Font Size: a A A

Clustering Algorithm Based On Biological Knowledge And Its Application On Gene Expression Data

Posted on:2011-08-05Degree:MasterType:Thesis
Country:ChinaCandidate:H SunFull Text:PDF
GTID:2178360305455246Subject:Bioinformatics
Abstract/Summary:PDF Full Text Request
With the extensive applications of Gene Chip technology, large amounts of numerical tabular data have been generated. Meanwhile, life science has ushered in the Post-Genome Era. In this era, the research focus has shifted from that on individual gene to that on the functions and the dynamics of the whole genome. How to analyze these data reasonably and effectively, and extracting valuable information has become a key issue in this field. Recent years, it has been proved that genes with the same function or involved in the same biological process are likely to co-express, hence clustering for gene expression data is the main technique for gene function prediction. Clustering analysis can group genes into different classes according to the different gene functions, and then develop the comprehensive research of the gene function and gene regulation.Gene expression data suffer severely from the problems of dimension curse, measurement noise, high relativity between genes, and have highly demand on data processing. Most existing clustering methods ignore known gene functions in the process of clustering, and also get the analysis results lacking of stability. These traditional clustering algorithms are very sensitive to the initialization, and also made the results lacking of biological interpretability. With these properties of gene expression data, it is recognized that the traditional clustering methods can not meet the demand of the data analyzing,and it is proved that incorporating biological knowledge into statistical analysis is a reliable way to maximize statistical efficiency. Cluster analysis has become an important analyzing procedure for gene expression data, but how to further analyze the results for cluster analysis in terms of biological knowledge at higher levels is still a problem in functional genome research.This paper main research clustering algorithm based on gene expression data. In the study of gene expression data, the commonly used clustering algorithms are Hierarchical clustering, K-means clustering, SOM clustering. In the present article, we proposed incorporating biological knowledge into clustering algorithms, which are more suitable to the expression data, and made full use of the accumulating gene function annotations. Based on concluding the related work carefully, this paper briefly introduces Gene Chip technology that relates to the gene expression data, and then reviews for the main statistical methods that applied on the gene expression data, such as cluster algorithm based on partition, cluster algorithm based on hierarchy and so on. Every method has its own advantage, but there are also some limitations. We focus on how to use the advantage of these methods, and designed the new algorithm which is suitable to gene expression data.Firstly,we proposed double k-medoids clustering methods based on known biological knowledge. We use a given radius and the modified distance metric to control the distribution of data point. Instead of simply treating all the genes independently, we group the genes according to their biological functions extracted from exiting biological database. The key idea is that we use the known gene function to generate the initial clustering centroids, and got the structure of clustering. To some extend, we made the distribution of data point more biological significance, if the two genes share a common function, they are more likely in the same area parted by the given radius. A simulation study and an application to a real dataset demonstrate the advantage of our proposal over the standard method, we can get more stable and reliable results, especially in discovering gene set with completely unknown function.Double K-medoids clustering method has its own limitations. It needs the number of clusters as an input, and it is hard to tackle the noise data. What's more, it can not make full use of the known gene function. In order to overcome these drawbacks of the former algorithms, we proposed K-density clustering method, which applied density information of genes into the traditional clustering algorithms. We use known gene functions to identify tight kernels, and then several procedures are used to expand the kernels into full clusters. After this step, some genes of unknown function will be included in the tight kernels, and remain genes form new clusters, which may correspond to possibly new functional category. This flexibility is desirable. In addition, k-density algorithm uses density information of known genes to explore the local core points, which overcome the drawback of setting the global core point in traditional methods.Finally,we applied K-density method on two real data sets. From the contrast of traditional clustering algorithms, we have seen that our proposal perform a better result than the standard method. Our method can get a higher accuracy, especially in identifying completely unknown gene function. Nevertheless, the current research seems to be mainly restricted to using biological knowledge as evaluating criteria to validate analysis result. In our methods, we have assumed that the genes can be partitioned into several groups based on the biological knowledge, such as gene functional annotations or metabolic networks. However, it is unclear that how to incorporating other types of knowledge(Go, pathway) into clustering methods. This need further discussion, and it is an interest topic to be studied in the future.
Keywords/Search Tags:Gene Expression Data, Clustering, Gene Function, Biological Knowledge
PDF Full Text Request
Related items