| With the further advancement of biological research and the development of single-cell RNA sequencing technology,clustering of sample data at the cellular level has become a reality.The clustering of single cell sample measurements into relevant biological phenotypes is an important task in biology research today.However,the biological data obtained by RNA sequencing not only has a very high data dimensionality,but also the limitations of RNA sequencing technology lead to a limited amount of gene expression being measured at a time,leaving a large amount of characteristic information missing,and the missing information is usually filled with‘0’ to maintain the integrity of the data,thus making a large proportion of ‘0s’ in the final data.The high dimensionality and the large amount of ‘0’ information make it difficult for traditional clustering algorithms to achieve good clustering results for such data,which places higher demands on the clustering algorithms.Recently,the MoeSim-VAE(Mixture-of-Experts Similarity Variational Autoencoder based on data similarity)clustering model has received much attention for its flexibility in adapting to complex data and excellent clustering accuracy.Thesis further improves and optimises this model based on it for the characteristics of biological data and some limitations of this model,with two main aspects:(1)In view of the high dimensionality and high proportion of ‘0’ in the biological cell data obtained from sequencing,thesis proposes a new ‘Maxpooling’ mechanism for processing biological data information based on Partial-VAE’s processing of missing feature information.In thesis,we propose a new ‘Maxpooling’ mechanism for biological data,which is applied to the data processing part of Moe-Sim-VAE,to achieve the de-zeroed and a certain degree of dimensionality reduction of the clustered sample data.Better clustering results than Moe-Sim-VAE were achieved in experiments with a higher proportion of "0s"(more feature information missing to make up the "0s")in the simulated dataset,improved the stability and robustness of the model.(2)For the poor feature extraction effect of biological data,as well as to further improve the clustering accuracy of the model.In thesis,the biological data are first taken to the random forest method for initial dimensionality reduction by feature selection,and then the data processed by feature selection are fused by KNN algorithm to find the nearest neighbours and string them together,so that the Moe-Sim-VAE model can extract the spatial structure features of the samples.At the same time,the limitations of the model in constructing the mixed Gaussian distribution of potential representations set to the same variance of all samples,and the fully connected layer is used to fit the variance parameters of the distribution,allowing the model to adaptively optimise the mixed Gaussian distribution of potential representations through training and learning.In addition three optimisation approaches are proposed for the improved model,and the combination of multiple methods achieves better clustering accuracy than Moe-Sim-VAE. |