| The development of high-throughput sequencing technologies has made single cell transcriptome sequencing and vast amounts of gene expression data available.Analysis of the gene expression data generated by sequencing is not only useful for inferring gene expression profiles and revealing differential expression between cells,but can also provide a strong scientific basis for studying areas such as embryonic development and cell differentiation.Cell classification in complex samples or tissues is a fundamental goal and prerequisite for scRNA-seq data analysis and subsequent cell type identification studies,thus cell classification studies based on scRNA-seq data have become a key topic of research for scholars.The most common method currently used to accomplish the task of cell classification of single-cell transcriptome sequencing data is through clustering or classification algorithms.However,with the comprehensive development of single-cell technology and the closer integration of data science and life science fields,more and more single-cell histology data are subsequently integrated,and scRNA-seq datasets present complex and diverse characteristics,which bring numerous challenges to experimental research.On the one hand,scRNA-seq data have a large amount of redundant information and the gene expression matrix is excessively sparse due to the technical limitations and human measurement error in single-cell RNA sequencing.The high sparsity of scRNA-seq data is not well resolved by traditional single clustering algorithms,which cannot accurately describe the non-linear mapping between gene expression and cell type.On the other hand,the large amount of gene expression information contained in single-cell sequencing data and the thousands of gene dimensions create a "dimensional disaster" for data analysis.To address the problem of high data sparsity and high dimensionality in cell classification,this paper presents an in-depth study cell classification for scRNA-seq data based on autoencoder and related algorithms.The research is as follows:(1)From an unsupervised perspective,we propose a cell classification model based on autoencoder and hierarchical clustering for scRNA-seq data,called scAEHClust.Firstly,in order to retain genes that are valuable for experimental analysis,genes that have an expression of 0 in all cells and whose expression exceeds a defined threshold need to be filtered.Also,in order to keep the expression of each gene on the same scale,the reads in gene expression matrix were log-transformed.Secondly,to reduce the sparsity of scRNA-seq data,the missing values in the gene expression matrix were interpolated using a three-layers autoencoder to reduce the impact of dropout events on the experimental results.Thirdly,cell classification is achieved by constructing a cell binary tree and changing the criteria for dividing cell groups based on the traditional hierarchical clustering algorithm.Finally,in order to present the cell classification effect directly,the high-dimensional scRNA-seq data is mapped to twodimensional space using the t-SNE visualization method,which is presented as different colored points.Comparative experiments show that the scAEHClust model has the best ARI and NMI than K-means,spectral clustering and Seurat,and can effectively perform the cell classification task.(2)From a semi-supervised perspective,we propose a cell classification model based on the denoising autoencoder for scRNA-seq data,called scSemi DAE.Firstly,in order to reduce the size of scRNA-seq data,feature extraction of gene expression profiles is performed using a denoising autoencoder.Constructing a new classification objective function by balancing the importance of data reconstruction and data structure in low-dimensional space.Secondly,the classification objective function is iteratively optimized using some of the real cell label information in the dataset to retain as much valid information in the data as possible,while enhancing the similarities between the same type of cells and the differences between different types of cells in lowdimensional space,thus obtaining low-dimensional features that are easier to classify and reducing the dimensionality of scRNA-seq data.Finally,the K-means clustering algorithm is applied for classification.The results of the experimental results show that the scSemi DAE model outperformed the scSemi AE,net AE and other dimensionality reduction methods in the Zeisel,Deng,Baron Mouse,and Baron Human datasets in terms of ARI,NMI,and ACC results overall,indicating that the scSemiDAE model performs better in data reduction and cell classification.(3)From a supervised perspective,we propose a stacked autoencoder-based cell classification model for scRNA-seq data,called scSAERLs.To dig deeper into cell type-specific gene expression profiles,scSAERLs models are first greedy pre-trained to mine the non-linear relationship between gene expression and cell type and obtain the initial parameters of the model.The parameters are then fine-tuned and the network output is extracted as a feature representation of the scRNA-seq data,thereby extracting specific gene expression features for different cell subpopulations in the dataset,then reducing the dimensionality of scRNA-seq data.Finally,a softmax classifier is used to classify the extracted feature representations of the scRNA-seq data to achieve cell label prediction.Comparative experiments showed that the scSAERLs model has the best overall accuracy and F1-score over five advanced cell classification methods such as scmapcluster and Cas TLe on the Baron Mouse,Baron Human,Segerstolpe,and Zhengsorted datasets,and scSAERLs has the best classification performance and is able to identify new cell types. |