Font Size: a A A

Clustering Algorithm And Disease Association Research For Genetic Data

Posted on:2024-04-27Degree:MasterType:Thesis
Country:ChinaCandidate:T WangFull Text:PDF
GTID:2544307142481774Subject:Software engineering
Abstract/Summary:PDF Full Text Request
The breakthrough from batch cell transcriptome sequencing technology to single cell transcriptome sequencing(sc RNA-seq)technology has provided a new way for the majority of researchers to study the biological issues of gene expression and transcription at the level of single cells.The original clustering method developed based on batch cell data is no longer applicable to single-cell data,so the construction of clustering algorithm suitable for the characteristics of single-cell data is a hot issue to be solved in this field.However,in the face of small sample size,random gene expression,low cell capture and low sequencing efficiency,a lot of technical and biological noise is generated,resulting in the characteristics of highdimensional,high-noise,high-throughput and batch effect of single-cell transcriptome sequencing data.This topic starts from the characteristics of single-cell gene expression data to study clustering algorithms,improve the clustering effect of the algorithm model.According to the characteristics of single-cell gene data,a quadratic clustering algorithm model based on manifold learning and automatic encoder(UASK clustering model for short)was designed,and the correlation between genes and diseases was studied.Specific work contents are as follows:1.In view of the high-dimensional characteristics of single-cell transcriptome sequencing data,a gene dimension reduction algorithm model based on Uniform Manifold Approximation and Projection(UMAP)was designed,The model initialized low-dimensional data of high-dimensional feature data,which could well achieve data dimension reduction processing.Meanwhile,compared with other dimensionality reduction methods,it is proved that the clustering algorithm based on UMAP dimension reduction framework has better performance.2.Aiming at the characteristics of high noise,sparsity and a large number of missing values in single-cell transcriptional sequencing data,an interpolation model of missing values of gene data based on neural network automatic encoder method was designed to solve the problems caused by a large number of "false zero values" expression for subsequent analysis.Aiming at the problem of slow operation efficiency and low precision of clustering in highthroughput gene data,self-organizing mapping and k-means secondary clustering algorithm are introduced into the clustering module,and the advantages of SOM neural network fast speed and k-means clustering algorithm high accuracy are fully utilized to improve the clustering performance of gene data.Therefore,a complete genetic data secondary clustering algorithm model(UASK model for short)based on manifold learning and automatic encoder is designed.By comparing ARI and NMI scores with SIMLR,sc GMAI,CIDR and Seurat classical clustering algorithms,UASK method shows a better result.3.Cluster analysis based on genetic data is widely used in studying the association between genes and diseases.Therefore,the association study of gene and disease was carried out,and the association between the polymorphism of TCF7L2 gene and type 2 diabetes mellitus was systematically evaluated by Meta analysis method.The conclusion was reached that the polymorphism of rs7903146 site of TCF7L2 gene was strongly correlated with type 2diabetes mellitus,which better interpreted the significance of cluster analysis of genetic data.Through the study of the above work content,the clustering algorithm and disease association research of genetic data are completed,the clustering algorithm suitable for the characteristics of single-cell data is constructed,and the existing problems of low quality,poor clustering accuracy and low operating efficiency of the existing clustering algorithm when processing single-cell data are solved.The research content of this paper has theoretical significance and practical value for understanding the problems in the field of medicine,such as genome and the origin,development and treatment of diseases.
Keywords/Search Tags:Gene Expression Data, Clustering, Dimension Reduction, Neural Networks, Autoencoder
PDF Full Text Request
Related items