Font Size: a A A

Research On Low-rank Representation Methods For Cancer Gene Expression Data Mining

Posted on:2021-03-20Degree:MasterType:Thesis
Country:ChinaCandidate:X X XuFull Text:PDF
GTID:2434330605460332Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Nowadays,cancer is a big killer threatening human life.With the development of the second generation sequencing technology,many available gene expression data have been produced.They contain abundant gene expression information,which provides data support for researchers to reveal the pathogenesis of cancer at the molecular level.However,they usually has the property of “high dimension,small sample and strong noise”,which is a great challenge in cancer data mining undoubtedly.Low rank representation(LRR)is a matrix decomposition method that can reduce dimension of data and reduce the impact of noise,and it has achieved a lot of success in cancer data mining field.On the basis of consulting and summarizing many domestic and foreign related literatures,the author put forward three new LRR methods in view of some shortcomings of the existing LRR methods,and applied them to cancer gene expression data mining,which aims to study the internal mechanism of cancer lesions and analyze cancer subtypes more accurately.The specific research contents are as follows:(1)A new LRR method regularized by the truncated nuclear norm and graph-Laplacian is proposed: The singular values of low rank matrix decomposed from the observation data matrix via LRR algorithm are a fast decreasing data sequence.Thus,the nuclear norm minimizing all singular values is not the best choice to approximate rank function of matrix.The novel algorithm uses the truncated nuclear norm instead of the nuclear norm to deal with convex relaxation problem of low rank matrix,which can retain the information of main components related to matrix,effectively reduce the damage caused by shrinkage of singular value,and more accurately approximate rank of matrix.Moreover,the new method introduces the graph-Laplacian term which can capture the intrinsic geometric structure and the similarity information lying in data.The results of cancer gene expression data mining experiments show that the improved algorithm enhances the robustness in tackling noise and outliers.(2)A new LRR method regularized by two hypergraph-Laplacian is proposed: The existing LRR mrthods seize the intrinsic geometric structure hidden in data space by imposed the graphLaplacian constraint on low rank matrix.But,the graph-Laplacian cann't find the co-expression information in gene expression data.To remedy this defect,the novel method introduces two hypergraph-Laplacian,and imposes them on low rank matrix and sparse matrix respectively for extracting the intrinsic geometric structure existing in sample space and gene space of cancer data.The results of cancer gene expression data mining experiments show that the performance of encoding data structure space of method is improved by the above improvements.(3)A new latent LRR method regularized by the truncated nuclear norm and graph-Laplacian is proposed: It is not the best choice that most LRR methods use the original data matrix with small samples and high noise as the dictionary matrix.The novel algorithm decomposes the original data matrix into two feature matrices and a sparse matrix.One of two feature matrices is used for cancer sample clustering,and another is used for differential expression gene recognition,which can better deal with cancer data with insufficient samples and noise.Besides,this method takes the integred cancer genome data as study object to explore the internal relationship between multiple cancers,which effectively solves the problem of sample imbalance.
Keywords/Search Tags:Gene expression data, Truncated nuclear norm, Hypergraph-laplacian regularization, Differentially expressed genes, Cancer sample clustering
PDF Full Text Request
Related items