| Since the 21st century,the completion of the human genome project marks an unprecedented step for scientists to explore the mysteries of human beings.The cell is the basic unit of organism structure and function.As early as 2017,world-renowned scientists jointly proposed the “Human Cell Atlas Project”,dedicated to the systematic description of the human cell atlas.The project is mainly to define new cells by sequencing all cells in the human body,depicting the spatial structure of all cells and the subtle relationships between cells,so as to enable people to have a more comprehensive understanding of the pathogenesis of diseases and provide new research direction for the diagnosis and treatment of the disease.This is also an important symbol of the industrialization of single-cell research.The single-cell RNA-sequencing(scRNA-seq)technology can better help us decode life from a higher resolution and spatiotemporal structure,and accurately reflect the heterogeneity between cells.The human body is an extremely complex individual composed of many types of cells.scRNA-seq technology can make us study human diseases more clearly.However,the scRNA-seq data has the characteristics of large amount,high dimension,and high noise,which makes it difficult for the existing traditional machine learning algorithms to effectively process and analyze the scRNA-seq data.Therefore,the development of efficient machine learning approaches to process and analyze scRNA-seq data is of great significance to understand the pathogenesis and treatment of human diseases.In our thesis,we have conducted an in-depth study on the scRNAseq data.The main research contents and innovative work are as follows:(i)The continuous development of RNA sequencing technology provides new insights for understanding biological systems.In particular,the scRNA-seq technology represents a major breakthrough in this field.The scRNA-seq technology provides a powerful tool to determine the precise expression patterns of thousands of single cells,and to decipher cell heterogeneity and cell subpopulations.However,due to the various technical noises,such as the presence of "dropout" events(i.e.,excessive zero counts),the analysis of scRNA-seq data is still challenging.In order to solve this challenge,this thesis proposed a novel method based on collaborative matrix factorization by considering the associations the relationship between cells and genes,called CMFImpute,to estimate the deletion terms of a given scRNA-seq expression matrix.We tested CMF-Impute and compared it with other five latest methods on six popular realworld scRNA-seq datasets with different sizes and three simulated datasets.CMFImpute is a more accurate imputation tool,outperforming the comparison methods.Finally,we demonstrate the powerful function of the CMF-Impute in reconstructing intercellular and intergenic correlations and inferring cell lineage trajectories.(ii)The scRNA-seq technology is a revolutionary breakthrough,which determines the precise gene expression of a single cell and deciphers the heterogeneity and subpopulation of cells.However,due to the limitations of technology,the scRNA-seq data is more noisier than the bulk RNA-seq data.When faced with the stacked data,it will cause the traditional dimensionality reduction and visualization methods to be ineffective.Herein,this thesis proposed an improved variational autoencoder method(called scIVA)for dimensionality reduction and visualization analysis of scRNA-seq data.scIVA not only combines the variational autoencoder and Gaussian mixture model,but also explicitly models “dropout” events by introducing a zero inflate(ZI)layer to obtain a low dimensional representation of the changes in the scRNA-seq data.The benchmark comparison of 10 scRNA-seq datasets shows that the performance of scIVA is better than that of the five state-of-the-art methods.In addition,scIVA can accurately capture the expression dynamics of human preimplantation embryos.(iii)The main challenge in the analysis of scRNA-seq data is the growing amount of datasets.In large datasets,it is very difficult to identify cell populations,because many existing scRNA-seq clustering methods cannot scale up to deal with them.In addition,the batch effect caused by various reasons(e.g.,the systematic gene expression difference between batches)is also one of the urgent problems to be solved.If the batch effect cannot be eliminated,it will complicate downstream analysis and lead to misinterpretation of the results.Thus,This thesis proposed a deep clustering method for scRNA-seq data based on graph embedding(called scGEDC).To constrain the manipulation and maintain the local structure of data generating distribution,an under-complete autoencoder is applied.By integrating the clustering loss and autoencoder ’s reconstruction loss,scGEDC can jointly optimize cluster labels assignment and learn features that are suitable for clustering with local structure preservation.In addition,scGEDC can make two similar cell features as close as possible by introducing graph loss.The experimental results show that scGEDC is a useful tool for a series of basic analysis tasks,including batch correction,visualization and clustering.The performance achieved by scGEDC for each task is better than several other benchmark algorithms. |