Font Size: a A A

The Research On Clustering Algorithm Applied To Gene Expression Data

Posted on:2014-12-30Degree:MasterType:Thesis
Country:ChinaCandidate:X H WangFull Text:PDF
GTID:2268330398499494Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
Thousands of gene expression data can be produced from experiment ofgene chips recently,which contains the rich information that can explain thephenomenon of life, By analysising this gene expression data we can understand thegenetic information how to converted to a functional gene product. Clusteringalgorithm as a kind of important analysis method is widely used to detect thebiological information of gene expression data.The basic principle of clustering algorithm is to divide multiple variablesinto multiple classes according to the similarity measure. The conventional clusteringalgorithm cluster genes or conditions respectively. The conventional clusteringalgorithm is based on the assumption that related genes behave similarity under allthe conditions,which can only capture global information of the gene expression data.Because a lot of local patterns are existed in the high-dimension gene expressiondata, coclustering algorithm has been proposed recently as a powerful computationaltool to detect subsets of genes that exhibit consistent pattern over subsets ofconditions. In spite of much research in this domain, existing co-clustering algorithmshave some critical limitations in terms of identifying coclusters (a cocluster of a geneexpression data is a subset of genes which exhibit similar expression patterns along asubset of conditions)with different types of correlations in the data and the ability tocapture overlapping co-clusters in the data matrix. In this article, we compare andanalysis several coclustering algorithms, then we present a new coclusteringalgorithm that combined with clustering algorithm. We evaluated our algorithm onseveral real-world gene expression datasets, and the experimental results showedthat the proposed algorithms is able to?nd biological signi?cant coclusters and alsooutperformed some of the well-known existing co-clustering algorithms in terms ofthe quality, size and biological signi?cance of the co-clusters.The main innovation of this article include the following respects:(1)basedon ideas from lossy data coding and compression,we present a simple but effectivetechnique for clustering genes, the goal is to find the optimal segmentation that minimizes the overall coding length. The advantage of this algorithm is canautomatically determine the number of clustering.(2) After analysising theadvantages and disadvantages of the current popular of coclustering algorithms,wecombine coclustering algorithm with the clustering algorithm via lossy datacompression.Our algorithm uses a novel ranking-based objective function that isoptimized to simultaneously?nd large co-clusters with minimum residual errors.Itallows positively and negatively correlated objects to be members of the sameco-clusters and can extract overlapping co-clusters.In addition, the coclusters can bearbitrarily positioned in the data matrix.
Keywords/Search Tags:gene expression data, clustering, coclustering, lossy datacompression, positive and negative correlation
PDF Full Text Request
Related items