| Cell is the basic unit of organism structure and function,which can carry out and complete various specific functions of organism.Single-cell RNA sequencing enables researchers to study heterogeneity between individual cells and to discover and define new cell subtypes from a transcriptome perspective.With the continuous development of sequencing technology,single-cell sequencing technology has the advantage of high depth and high throughput,and can accurately measure the relevant characteristics of a single cell.Cluster analysis is an effective method to study single-cell heterogeneity.Currently,there are many clustering algorithms for single-cell RNA sequencing data,but there is a common problem:Due to the amplification failure of reverse transcription step in single-cell RNA sequencing experiments,a large number of genes cannot be detected,which is often called the dropout phenomenon.Dropout phenomenon generally shows that there are too many zero and near-zero values in the data set,which is the main reason for the unsatisfactory effect of current clustering algorithms and also a great challenge for single-cell clustering problems.In addition,most of the existing clustering strategies are based on a single algorithm,which is difficult to achieve the optimal generalization of complex single-cell data.To solve the above problems,this paper proposes two algorithms,the main work is as follows:(1)In this paper,a dynamic interpolation algorithm based on Hsim distance and cosine distance is proposed to solve the dropout problem in single-cell sequencing data.We first calculate the threshold to determine the dropout candidate set.Then we estimate the relationship between dropout probability and expression level using the logistic decline function of the nonlinear least square regression fitting data.Then we perform dynamic interpolation on the missing genes in the dropout candidate set to obtain gene expression values closer to the true values.Finally,Hsim distance and cosine distance functions were used to construct the cell difference matrix.(2)Based on the cell difference matrix of dropout,we propose a weighted multidistance consistent clustering algorithm.Based on the traditional consistent clustering algorithm,we fully consider the influence of different k values for k-means clustering and different distance measures on the clustering effect.Firstly,different k values are used for k-means clustering,and each clustering result corresponds to a consistency matrix.Then,three distance measures were used to calculate the distance matrix,and hierarchical clustering was carried out.A comprehensive matrix was constructed based on the grading of hierarchical clustering.Finally,a hierarchical clustering was carried out on the comprehensive matrix to obtain the final cell clustering result.To verify the effectiveness of the two algorithms,we conducted comparative experiments on six single cell data sets.With regard to dropout,the experimental results show that the proposed dynamic interpolation algorithm based on Hsim distance and cosine distance can effectively fill in the missing genes and improve the accuracy of clustering.As for the clustering problem,the comparative experimental results show that the weighted multi-distance consistent clustering algorithm proposed in this paper performs well on the six data sets,which improves the performance of the traditional cell consistent clustering algorithm to a certain extent.In addition,the algorithm can improve the running speed on the premise of ensuring the accuracy of clustering,and has certain universality. |