Font Size: a A A

Research On The Clustering Method Of Cancer Subtypes Based On Genomic Data

Posted on:2021-03-19Degree:MasterType:Thesis
Country:ChinaCandidate:Y H ZhouFull Text:PDF
GTID:2370330605961039Subject:Computer technology
Abstract/Summary:PDF Full Text Request
Cancer is one of the major diseases that threaten human life.It is of great significance to improve cancer detection mechanism effectively in future cancer research.With the advent and development of high-throughput sequencing technology,genomic data provide new opportunities for cancer discovery and treatment.However,the genomic data are often characterized by high dimensions,small samples and high noise,and traditional clustering methods are difficult to be directly applied in the mining and analysis of cancer genome data.In this paper,on the basis of studying genomic data,a spectral clustering algorithm based on graph theory is used as the main research direction to establish a spectral clustering algorithm model based on graph theory and improve the existing spectral clustering algorithm.The following aspects are mainly studied in this paper:Since the traditional spectral clustering algorithm cannot describe the relationship between data points in space well,a density similarity spectral clustering algorithm is proposed to replace the Gaussian kernel function as the standard of similarity measurement.When two data points are adjacent,the Hsim measurement method is used as a weight value of Euclide distance to avoid excessive influence on the similarity measurement between data points.When two data points are not adjacent,the density similarity measurement method is designed to reduce the distance between data points in high-density areas and enlarge the distance between data points in low-density areas,so as to better reflect the real distribution of data sets.In view of the fact that the eigendecomposition of Laplacian matrix in a large data set will cause excessive complexity in time and space,an improved stochastic singular value decomposition method is proposed to calculate the eigenvectors of the sample submatrix.The Nystrom low-rank approximation method is used to reduce the computational complexity by sampling and approximation strategies of the data set,the symmetry of the matrix is used to extract more meaningful points,and the improved singular value decomposition method is adopted to save the calculation cost and improve the efficiency of the algorithm while ensuring the clustering accuracy.In order to verify the exactness of the similarity of spectrum density clustering algorithm,respectively on the artificial data and real data sets based on clustering effect,and after the experiment is analyzed by clustering results,can be improved the density of the similarity of spectral clustering algorithm can better describe the relationship between data points and improve the accuracy of the algorithm.In order to verify the improved random SVD Nystrom of spectral clustering algorithm on the cancer genome data subtype clustering application of pancreatic cancer in the cancer genome project and gene expression on gastric cancer data set in the database cluster,from the survival curve and heat map gene expression analysis of clustering found cancer subtypes pathological significance.Finally,it was determined that the improved Nystrom spectral clustering algorithm for random SVD proposed in the project could be applied to the discovery of subtypes of genome data.
Keywords/Search Tags:spectral clustering, Cancer subtypes, Genomic data, Similarity measure, Nystrom approximation, Random SVD
PDF Full Text Request
Related items