Font Size: a A A

Research On Robust Matrix Factorization Method And Its Application In Gene Expression Data

Posted on:2019-10-05Degree:MasterType:Thesis
Country:ChinaCandidate:C M FengFull Text:PDF
GTID:2430330548972665Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Cancer(malignancies)has become is the number one health problem of human health.Its major regulations are hidden in gene expression data which is obtained by gene chip technology and the next-generation genome sequencing technology.Such data has the characteristics of high dimensional small samples,and only a few genes are also known as characteristic genes to participate in cancer disease.Matrix decomposition is an effective way to extract feature genes from high-dimensional data.However,with the deepening of research,traditional technology can not grown enough to meet the growing demand.For example:(a)the unsupervised matrix decomposition method has the disadvantage of high ambiguity in training samples;(b)the objective function is calculated by the square term,and the sensitivity to noise and abnormal values are increased;(c)the Principal Components(PCs)of PCA is dense and the biological significance of the PCs is not clear;(d)the geometric structure of the nonlinear data cannot obtained by traditional linear dimensionality reduction method.Thus,it is difficult to make reasonable biological explanations when they are used to feature extraction.In this paper,we improve the original algorithm to lay foundation for extracting oncogene,cancer prevention,diagnosis and treatment.(1)We propose a new method called supervised discriminative sparse PCA(SDSPCA).The main innovation of this new method is joint discriminative information and sparsity into PCA.Specifically,in contrary to traditional sparse PCA imposing sparsity on the loadings,we obtain sparse components to represent the data meanwhile through a linear transformation the sparse components approximate the given label information.We apply SDSPCA to common characteristic gene selection(com-characteristic gene)and classification on multi-view biological data.The new method is easy to be solved and the speed of convergence is fast.The experiments results demonstrate SDSPCA outperforms the state-of-the-art methods.(2)We propose a new robust method called L1/2 constraint graph-Laplacian PCA(L1/2gLPCA).First,the Manifold Learning is introduced to construct the internal geometry of the data.Then,the error function based on the L1/2-norm helps to reduce the influence of outliers and noise.Augmented Lagrange Multipliers(ALM)method is applied to solve the sub-problem.This method gets better results in feature extraction than other state-of-the-art PCA-based method.Extensive experimental results on simulation data and gene expression data sets demonstrate that our method can get higher identification accuracies than others.(3)We develop a novel PCA method enforcing P-norm on error function and graph-Laplacian regularization term for matrix decomposition problem,which is called as PgLPCA.The heart of the method designing for reducing outliers and noise is a new error function based on non-convex proximal P-norm.Besides,Laplacian regularization term is used to find the internal geometric structure in the data representation.P can be arbitrarily taken in the range of 0?1,which ensure the flexibility and robustness of the algorithm and suitable for a variety of data.This method is used to select characteristic genes and cluster the samples from explosive biological data,which has higher accuracy than compared methods.(4)We propose a new method called sparse graph Laplacian PCA(gLSPCA).First,we encode the internal geometric structure in this model to improve the clustering accuracy.Then,we select PCs with sparse coding.Extensive experimental results demonstrate that gLSPCA is effective in characteristic genes selection and clustering.Besides,these excavated genes provide several new clues for the study of causative factors of cancer.(5)We propose a new method,called dual graph-regularization PCA(DGPCA).This method simultaneously describe the geometric structures of the condition and gene manifold.We joint Laplacian embedding in PCs to approximate the data and the cluster indicators.The condition and gene manifold interact on each other,which helps bi-clustering and extract the"checkerboard" structures on gene expression data.This method can be calculated by closed-form solution.Promising results of DGPCA have been verified by extensive experiments.
Keywords/Search Tags:Principal Component Analysis, Gene expression data, Characteristic selection, Clustering, Classification, Bi-clustering, Manifold learning
PDF Full Text Request
Related items