| The invention of single-cell transcriptome sequencing technology has enabled the study of cells to rise from the level of population cells previously sequenced by Bluk to the level of single cells.As the technology matures,the cost of sequencing a single cell has dropped significantly,and more and more large-scale data are being measured.Larger-scale data could allow researchers to delve deeper into cell-to-cell differences and uncover unknown cell types.Clustering,as the most commonly used method for cell-type identification in single-cell transcriptome research,is the basis for subsequent in-depth research on singlecell data.However,with the growth of data scale,the clustering algorithm at the present stage has appeared problems such as low efficiency and abnormal operation.Therefore,it is necessary to develop an efficient clustering algorithm for large-scale single-cell transcriptome data to cope with the rapidly growing data scale.In this paper,we proposed LSc EClust,an efficient clustering algorithm for large-scale single-cell transcriptome data.The core idea is to reduce the size of cells that need clustering by sampling,to greatly reduce the composition time by approximate search instead of precise search,and finally to obtain the clustering results of all cells by efficient label allocation.The LSc EClust algorithm enables it to cope well with large-scale data through several steps of data preprocessing,nearest neighbor sampling,dimensionality reduction,clustering,and assigning labels to cells that do not participate in clustering.In particular,through the novel nearest neighbor sampling algorithm proposed in this paper,by constructing a nearest neighbor graph to screen high-quality cells,the scale of data that needs to be processed by subsequent algorithms is greatly reduced without affecting the clustering accuracy.At the same time,nearest neighbor sampling is also scalable,making it efficient to work with larger-scale data in the future.In addition,the validity and composition efficiency of nearest neighbor sampling are verified and analyzed in this paper.The results show that the effect of nearest neighbor sampling is better than that of traditional sampling methods,and the composition efficiency of using approximate nearest neighbors is much higher than that of traditional precise composition.In order to verify the efficiency and effectiveness of the algorithm,the LSc EClust algorithm and the commonly used single-cell clustering algorithms SC3,CIDR,SNN-Cliq,Seurat,drop Clust and SHARP were experimentally verified and compared in single-cell transcriptome data of different levels,including simulated data,normal-scale data,largescale data(more than 40,000 cells)and ultra-large-scale data(more than 100,000 cells)respectively.LSc EClust exhibited the shortest running time and good accuracy in single-cell simulation data at the scale of 500 to 100,000,and the running time increased slowly as the data size increased.In the real data,the normal scale data has fewer cells,although LSc EClust runs the fastest in this dataset,the running time of the algorithm with better clustering effect is within the acceptable range,therefore,in this scale dataset The advantage of LSc EClust is not obvious.In large-scale and ultra-large-scale data,LSc EClust has similar accuracy to other algorithms but far lower running time than other algorithms.In the100,000-scale data set,the running speed is 6 times faster than the fastest algorithm among other algorithms.At the same time,LSc EClust is one of the few clustering algorithms that can run on 690,000 large-scale data.Furthermore,benefiting from the scalability of nearest neighbor sampling,the running time of the partitioned LSc EClust algorithm can be further reduced.All experimental results show that the LSc EClust algorithm is the fastest clustering algorithm and has a good clustering effect. |