The development of highthroughput biotechnology and data acquisition technology has produced a big biological data,such as gene expression data and gene interaction networks.Gene expression data analysis has been widely used in cancer subtype categorization,gene therapy,drug research and other fields.Cancer can be classified into subtypes based on the pervasive differences in their gene expression patterns,and thus to provide a cancer patient with precise treatment,and contribute to drug research and pathological analysis.Clustering is one of the most effective techniques for analyzing gene expression data.How to use clustering to improve the efficiency and accuracy of discovering cancer subtype is of great significance.Gene expression data is characterized by high dimensionality,high noise and small samples,and similar genes(or samples)may exhibit similar behaviors only over a subset of samples(or genes).Traditional clustering,however,separately group gene expression data from samples(or genes)dimension only,is generally based on the global feature information of the samples and faced with many deficiencies in discovering cancer subtypes based on gene expression data.Co-clustering(or bi-clustering)can simultaneously group genes and samples of an input gene expression data matrix to discover co-clusters that relevant samples exhibit similar gene expression profiles over a subset of genes,and receives increasing attentions in gene expression data analysis.However,most existing co-clustering algorithms can only discover one type of co-cluster,explore various types of co-clusters by greedy search but with low efficiency.Gene interaction networks can help to understand the pattern of cancer subtypes,but they are rarely integrated to co-clustering process for exploring cancer subtypes.In addition,clustering ensemble can improve the accuracy and robustness.Co-clustering ensemble is more challenging than traditional clustering ensembles on optimization method and the time complexity,since it needs to obtain the final co-clustering solution from two dimensions simultaneously.Existing co-clustering ensemble algorithms are difficult to deal with large-scale data,and cannot make full use of the structure of obtained base co-clusters.This paper aims to address the above problems in clustering cancer gene expression data,and to improve the accuracy and efficiency of clustering gene expression data for identifying cancer subtypes.The main work of this thesis is as follows:(1)We propose a network-aided co-clustering algorithm based on matrix factorization(NetBC).NetBC firstly adopts GeneRank to assign weights to genes according to the deviation of gene expression values and gene interaction network.Then,the weight matrix is combined with the sum-squared residuals objective based on the matrix tri-factorization.Finally,the indicator matrices of rows and columns are optimized iteratively to obtain final co-clusters.The experiments on several real cancer gene expression datasets show the validity and superiority of NetBC in categorizing cancer subtypes.In the experiments of injecting simulated noises,NetBC is more robust to noise than other related methods.In addition,NetBC can effectively discover more different types of co-clusters on the datasets with simulated different types of co-clusters.(2)To effectively integrate multiple base co-clustering solutions,we propose a co-clustering ensemble(CoCE)approach based on a hybrid graph.CoCE firstly produces multiple base co-clustering solutions by repeatedly running different base co-clustering algorithms or co-clustering algorithm with different initializations.Then,CoCE evaluates the qualities of the discovered co-clusters and consequently measure feature-to-object relevance.After that,along with feature-to-feature and object-to-object similarities,contributes to the definition of a hybrid graph.The consensus process uses the resulting hybrid graph;it's formulated as a trace minimization problem and introduces a block-wise matrix multiplication technique to perform the optimization.Experimental results on various datasets show that CoCE not only frequently outperforms other related co-clustering ensembles,but also has reduced runtime cost,and is more robust to poor base co-clusterings. |