Font Size: a A A

Research On Clustering Algorithm Based On Single-cell Gene Expression Data

Posted on:2021-05-07Degree:MasterType:Thesis
Country:ChinaCandidate:M ZhaoFull Text:PDF
GTID:2480306050964659Subject:Software engineering
Abstract/Summary:PDF Full Text Request
Single-cell is a new field of the biological information.Based on the single-cell,the research of cell information has increased from cell groups to the single-cell information,which is helpful for learning about heterogeneity between cells.It is of great significance for the development of biological information to excavate the unique expression characteristics of a cell.Cancer is an important area of the human diseases which people concern.Single-cell is an important way to understand the characteristics of cancer cells.This article studies cluster algorithms for single-cell gene expression datasets for a variety of cancers.On six single-cell gene expression datasets for five types of cancer,the data processing is optimized to make the accuracy clustering about cancer cells higher than other clustering algorithms,and the algorithm is stable for single-cell data.For the pre-processing of single-cell gene expression data,this article is based on the source single-cell gene expression data.First,the weakly affected genes are defined,that is,the all expression values of a gene are zero,the non-zero expression value for all cells is less than 5%,the variance of non-zero expression values of a gene is less than 5.In order to more effectively extract data information in downstream analysis,filtering weakly affected genes.Then,the remaining gene expression values are transformed by log2(y + 1)to increase the accuracy of downstream analysis.Aiming at the feature selection of single-cell gene expression data,this paper designs a feature selection algorithm based on window segmentation.On the pre-processing results of each single-cell gene expression data set,first,a fixed-size window is designed and moved with a fixed step to divide the entire data set into multiple single-cell gene expression data subsets.Then,M3 Drop Feature Selection algorithm is used to perform feature selection on each subset.Finally,the feature selection results of all data subsets are combined and deduplicated,which is the feature selection result of the whole single-cell gene expression data set based on the window segmentation algorithm.Aiming at the cluster analysis of single-cell gene expression data,this paper designs a cluster analysis algorithm based on integrated algorithms.On the feature selection results of each single-cell gene expression data set,first,the principal component analysis and the local linear embedding are used to reduce dimension and extract the linear and non-linear information of the gene expression data feature selection results,the accumulating contribution rate reached 85%,retain the contribution of the larger principal component,and the local linear embedding with the same dimension reduction result as the principal component analysis method.Next,based on the two different dimensionality reduction,Gaussian kernel spectral clustering,polynomial kernel spectral clustering,and hyperbolic tangent kernel spectral clustering are used to cluster,and six different clustering results are obtained.Then,based on the six different clustering results,using the integrated algorithm to obtain a consensus matrix,using K-means algorithm for cluster analysis,to obtain the final clustering results of single-cell gene expression data.Finally,comparing the annotations to calculate the accuracy of the clustering.Experiments show that the accuracy is higher than that of other five clustering algorithms.Aiming at the stability of the entire algorithm framework,this paper randomly arranges the genetic order of the preprocessing results of each data set to obtain different feature selection results,and then calculates the accuracy of the data analysis results.Experiments show that the accuracy of the clustering varies within 3% many times,and the entire algorithm has certain stability for single-cell gene expression data.
Keywords/Search Tags:single-cell, feature selection, kernel function, integrated algorithm
PDF Full Text Request
Related items