Font Size: a A A

Research On Single Cell Unsupervised Clustering Based On Matrix Decomposition And Graph

Posted on:2021-01-18Degree:MasterType:Thesis
Country:ChinaCandidate:S G BuFull Text:PDF
GTID:2370330605966469Subject:Software engineering
Abstract/Summary:PDF Full Text Request
Transcriptome sequencing data contains rich biological patterns that can reflect the gene expression of living organisms in a specific time or space.Typical applications include:embryonic development and differentiati on process research,diagnosis and treatment of cancer and other major diseas es,drug development and genes function discovery,etc.However,cell clustering typing using transcriptome data is the basis of all the above research questions,transcriptome data has the characteristics of high dimensions,small samples,no labeling,and high noise,which pose challenges to conventional data mining and machine learning tasks.Based on single-cell transcriptome data,this paper systematically studied the applicat ion effect of three clustering models in transcriptome data.In summary,the main innovations of this article are as follows:?1?According to the characteristics of single-cell transcriptome data,an effective data preprocessing method is designed,which c an better eliminate the dimensional difference of different gene expression values.At the same time,a heatmap visualization method of transcriptome data set is designed by introducing golden labels to guide the sequence rearrangement of cells and genes,so that we can clearly observe the dense clusters within the class,sparse cluster structure,which is helpful for us to analyze the differences in the data set and find marker genes;?2?Design shared neighbor and subgraph partitioning method based on negative correlation constraints.This method is divided into two processes:Graph construction and group merger.In the Graph construction process,we improved the traditional shared neighbor similarity measurement method by introducing the idea of neighbor distance ranking,the similarity between cells is more effectively portrayed;and then,each cell vertex is searched for in the graph model.Quasi-groups,that is,all the vertex are connected to the cell node,are continuously aligned with the group for pruning by specifying the parameter r,and finally a large number of groups are obtained.At this time,by specifying the parameter m,the groups with higher overlap are merged,thus the final cluster group is obtained.Finally,for the cells that appear in multiple clusters at the same time,the group with the largest sum of the weights of its connected edges is selected as the final attribution,which means that the cluster is more closely connected to the cells.In addition,we give three improvement point s and verify the effectiveness of the improved method through extensive datasets.Experimental results reveal our new methods has remarkable improved than original.?3?Design eigensubspace clustering method based on matrix decomposition.This method improve traditional matrix decomposition model to enable it to achieve clustering tasks.For the coefficient matrix P obtained by NMF non-negative matrix decomposition model,we think that its rows represent clustering categories and columns represent cell samples,then Pijreflects the degree of association between the jth cell and the ith category.Based on this idea,We assign NMF non-negative matrix decomposition to the gene expression matrix by specifying the number of categories k,and then we can get the clustering grouping of cells in the coefficient matrix P;for PCA principal component analysis,we first use it as a dimension reduction tool to reduce the dimensions of the gene expression matrix,we believe that the dimension reduction operation is beneficial to filter out the technical noise caused by transcriptome sequencing,while reducing the amount of calculation,and then use the k-means method of setting seeds in the above new feature space Get the final cell grouping;for the SVD singular value decomposition model,we mainly focus on the first k rows of the sub-matrix V,which is theoretically closely related to the cell grouping,and we also verify the above hypothesis on the artificial data set.Next,use the k-means method in the new feature space V[1:k,]to get the final cell grouping.We tested the above three methods on 20 published single-cell transcriptome datasets.Experimental results show that SVD and its improved methods are superior to PCA and are equal to NMF method.?4?Design a community discovery method based on graph denoising.For Euclidean distance,Gaussian distance,Pearson correlation coefficient,and Spearman correlation coefficient,we systematically studied the effect of calculating cell similarity in the single-cell transcriptome data set.The results show that Spearman is more suitable for this task.In the noisy cell similarity network,we introduce the diffusion theory proposed by Jure Leskovec's team from the Stanford University to improve the signal-to-noise ratio of the network,and use the enhanced network for cell grouping.The cell grouping method here uses Louvain algorithm which is based on module maximization.The results show that the GSE method can achieve the outstanding clustering performance,which is come up to the state of art methods.
Keywords/Search Tags:Single cell transcriptome, Clustering, Graph, Matrix decomposition, Network diffusion
PDF Full Text Request
Related items