Font Size: a A A

Research On Low-rank Matrix Decomposition Model For Cell Type Recognitio

Posted on:2024-05-16Degree:MasterType:Thesis
Country:ChinaCandidate:N N ZhangFull Text:PDF
GTID:2530306923488834Subject:Electronic information
Abstract/Summary:PDF Full Text Request
In recent years,single-cell RNA sequencing(sc RNA-seq)technology has been increasingly favored by researchers.The continuous development of sc RNA-seq technology provides researchers with new ideas for dealing with biological problems at the single-cell level.Research and analysis of sc RNA-seq data can enable researchers to study human diseases more clearly,which is of great significance to understanding the pathogenesis and treatment of human diseases.However,gene expression measurements generated through sc RNA-seq technology often have many zero values.Some of these zero values are genes that have never been expressed,known as“true zero”.Other false values that are actually present in genes but have not been amplified or detected are commonly referred to as “dropout” events.The high proportion of ’dropout’ events leads to sparse and high noise levels in sc RNA-seq data.Therefore,how to effectively and accurately identify cell types from data containing high noise remains a challenge.In this thesis,in-depth research has been conducted on sc RNA-seq data analysis from two aspects: data impute and data clustering.The specific research content is as follows:(1)To address the issue that “dropout” events in sc RNA-seq data can affect the accuracy of downstream analysis methods such as cell clustering,the author proposes a non negative matrix factorization based sc RNA-seq data estimation algorithm(sc NMF-Impute).The purpose of improving the performance of existing downstream analysis methods is to impute the missing values of the gene expression matrix in sc RNA-seq data.The sc NMF-Impute method explicitly models missing values caused by "dropout" events as a non negative sparse matrix(missing value matrix),while taking into account the interrelationships between genes,so that this method can effectively identify the location of missing values and accurately restore missing values.Finally,apply sc NMF-Impute to the real sc RNA-seq dataset to impute the missing values caused by "dropout" events,and restore the true gene expression of the sc RNA-seq data.The experimental results show that whether traditional clustering method k-means or SC3 clustering method is used to cluster analysis,the best experimental results can always be obtained on the dataset processed by sc NMF-Impute.(2)To solve the problem that most of the existing sc RNA-seq data clustering methods neglect the local geometric structure of the data,the author proposes a single cell type recognition method based on similarity and manifold graph regular constraints(SLRRSC).This method introduces the regular constraints of manifold graphs and the similarity information between cells into the low rank representation model,enabling the resulting low rank representation matrix to accurately describe the spatial relationships of sc RNA-seq data containing a large amount of noise,and enhancing the ability of the low rank representation model to obtain the internal structure of the data.Finally,SLRRSC is applied to cluster cell populations in the sc RNA-seq dataset to learn the heterogeneity between cells and identify cell types.The experimental results indicate that SLRRSC obtained the highest average values of NMI and ARI among all comparison methods(NMI=0.8962、ARI=0.8836).(3)Aiming at the problems of traditional sc RNA-seq data clustering methods based on low rank representation models,which usually select the original sc RNA-seq data matrix containing a large amount of noise as a dictionary,which is difficult to ensure the optimal clustering effect,the author proposes a single-cell type recognition method based on dictionary learning(DLNLRR).First,the method uses a linear combination of raw sc RNA-seq data as a dictionary instead of using a fixed dictionary.Therefore,the dictionary can be updated during the optimization process,enabling both dictionary learning and low rank representation learning.Secondly,this method can directly group samples based on the maximum value of the column vectors of the obtained low rank representation matrix,and achieve subspace clustering without relying on spectral clustering algorithms.Finally,DLNLRR is applied to cluster cell populations in the sc RNA-seq dataset to identify cell types.The experimental results indicate that DLNLRR achieved ARI values greater than 50% and the highest average on all experimental datasets.The methods proposed in this thesis have been applied to sc RNA-seq data,and experimental results show that these methods can obtain more accurate clustering results and effectively solve the problem of cell type recognition.
Keywords/Search Tags:Single cell sequencing, Missing value imputation, Cell type recognition, Manifold graph regulation, Dictionary learning
PDF Full Text Request
Related items