| The development and application of high-throughput sequencing technology have produced different types of transcriptome sequencing data.Single-cell RNA sequencing data and spatial transcriptome sequencing data,as two typical transcriptome sequencing data,have drawn extensive attention from researchers.These two types of data can be utilized to study gene transcription and regulation at the molecular level in cells at various development stages and growth conditions,which is essential to the prevention,diagnosis,and treatment of many complex diseases.However,due to technical limitations,singlecell and spatial transcriptome sequencing data widely exist a large number of missing values,called the ‘dropout’ event.It will complicate data analysis and reduce the accuracy of data analysis results.At the same time,existing imputation methods have limitations such as insufficient internal information mining of data,over-smoothing,additional error introduction,and limited applicability.To fully mine the invaluable information from data,this dissertation focuses on the missing values problem in single-cell and spatial transcriptome sequencing data as well as the limitations of existing imputation methods,then proposes new imputation methods to recover the data structure and biological characteristics.The main research contents of this dissertation are as follows:First,in light of problems of over-smoothing and additional error introduction in the existing imputation methods,this dissertation proposes a block imputation method based on cell-level and gene-level information to address the issue of missing values in singlecell RNA sequencing data.The method constructs a statistical model to identify dropout events based on gene expression levels and its variations,then uses gene expressions unaffected by dropout from similar cells to impute the missing values.The results on realdatasets showthat this method successfully preserves the heterogeneity of gene expression across cells,avoids over-smoothing of the data,and minimizes the introduction of additional errors.Moreover,this method outperforms other imputation methods in improving the accuracy of results for cell clustering,visualization,and differential gene expression analysis.Second,as the existing imputation methods do not consider the dynamics of the transcriptome,this dissertation proposes a multi-dimensional imputation method based on the transcriptome dynamic information to address the problem of missing values in singlecell RNA sequencing data.The method identifies local cell neighbors and specific gene co-expression networks based on the pseudo-time of cells,leveraging information on celllevel,gene-level,and transcriptome dynamic to recover single-cell RNA sequencing data.The results of real data show that the method recovers the distribution and structure of gene expression,improves the accuracy of trajectory inference analysis,differential expression analysis,cell clustering,and cell type identification,and applies to data from different sequencing platforms.Third,to address the missing values problem in spatial transcriptome sequencing data,this dissertation proposes an imputation method that integrates single-cell transcriptome information.The method first respectively constructs shared nearest neighbor graphs for cells and spots based on the shared nearest neighbor theory.Next,it constructs a graph regularization joint non-negative matrix factorization method to integrate the two types of data into the same low-rank space.Finally,it uses the transcriptional information from the nearest single-cell neighbors of each spot to impute missing values.Results of real datasets show that the method accurately recovers gene expression levels and enhances the outcomes of downstream data analysis for 10 x Visium spatial transcriptome sequencing data. |