Font Size: a A A

Multi-Source Fusion Based Clustering Analysis For Gene Expression Data

Posted on:2009-05-01Degree:MasterType:Thesis
Country:ChinaCandidate:J J ZhuFull Text:PDF
GTID:2178360272486767Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Gene expression data is the quantitative description of a group of gene expression and regulation information by DNA Microarray technology. It's very important to study the gene expression data and gain the meaningful gene clusters in disease diagnosis and research on the nature behind biological phenomena. However, as the characteristic that the number of samples is far smaller than observed genes in gene expression data and noise produced during acquisition process, cluster analysis based on gene expression data is often lack of stability, reliability. And it affects the accuracy of prediction.In this paper, we study cluster analysis technology based on gene expression data from multi-source fusion, and fuse GO (gene ontology), KEGG pathway, etc to acquire stable, reliable, co-expressioning clusters. Its main work is as follows:1. We choose the online public YEAST genome data as the test data, use LSA (Latent Semantic Analysis) method to do dimension reduction and denoising of gene expression data of YEAST genome and measure the similarity by Euclidean distance method. We use the method of semantic similarity to measure the similarity of GO (Gene Ontology), and use Bioconductor software to calculate the value.2. We use linear fusion method to fuse gene expression data and gene ontology from similarity measure, and use PAM (Partition Around medoids) algorithm to cluster gene expression data and gene ontology. Results show that linear fusion method can improve effectiveness of clustering results greatly.3. For the problem that linear fusion method can not determine fusion coefficient, we propose a novel fusion method: Permutation-based fusion. The method gives a number to every similarity value of gene expression and GO, which are sequenced in a descending order, and uses the number as coefficient to calculate the fusion data. This method can obtain fusion coefficient automatically, and is more operational in algorithm.4. For the problem that general evaluation method can't validate the effectiveness of clustering results of gene expression data from gene function, we propose a method that uses KEGG pathway data which can evaluate the effectiveness and significance of clustering results from biochemical function. When we use KEGG pathway method to evaluate the results, more than half clusters can be learned.In this paper, we use multi-source fusion method to cluster gene expression data and get better results. However, the fusion strategy is relatively simple and the function of data source in clustering is lack of systematic theoretical proof. So the next step includes two aspects. One is doing more research on different data to validate the effectiveness of fusion methods; the other is using information theory methods to study the role multi-source played in gene expression data clustering and to provide a theoretical basis for more effective fusion strategies.
Keywords/Search Tags:Clustering, Linear fusion, Permutation-based fusion, Gene expression data, Gene Ontology, KEGG pathway
PDF Full Text Request
Related items