Font Size: a A A

Analysis Of Gene Expression Data Clustering

Posted on:2008-12-20Degree:MasterType:Thesis
Country:ChinaCandidate:H YiFull Text:PDF
GTID:2120360272477024Subject:Biomedical engineering
Abstract/Summary:PDF Full Text Request
. With the development of MicroArray technology, more and more gene expression datasets are being obtained. So, how the useful information can be drawn from the gene expression datasets becomes an important issue in the Bioinformatic research field.Those genes with similar functions usually share similar expression patterns. The unknown genes'function can be forecasted by analyzing genes with similar expression pattern. Clustering algorithm is a Data-mining method which can partition data into clusters according to their similarity, making data of one kind come together. Using clustering algorithm, genes with similar expression can be clustered into the same group. It is helpful for finding the functions of genes and the co-relationships between genes.However, clustering is a subjective process. Different selection of algorithms, cluster numbers or starting seeds would lead to different outcomes. This makes the results of gene expression data clustering more subjective. Now, the key point of gene expression data analysis is that how to use the existing algorithms effectively and make the clustering algorithms more objective. This would improve the accuracy of gene expression data analysis.For all above mentioned problems, we've studied the following work in this thesis:(1) The fact that there exists a great deal of missing values in the gene expression data due to various reasons will affect the accuracy of clustering. General Regression Neural Network was employed in this thesis to estimate the missing value.(2) Different clustering algorithms on gene expression data were studied; Some advanced algorithms were introduced; The relationships between clustering algorithm and data distribution structure were also investigated.(3) Different distribution structure of gene expression data should be clustered by different algorithms. It is difficult for us to obtain the distribution structure of high dimensional gene expression data. In this thesis, the stability of clustering results was taken as an evaluation criteria, and stability-based selection method was proposed for clustering algorithms.(4) Employing the same algorithm for the same dataset, the results of clustering may vary from time to time, because the starting seeds of clustering each time are different. The seeds setting affects the probability of falling into local minima and the times of iteration while clustering. In this thesis, a PCA-based method was proposed for gene expression data clustering seeds setting.
Keywords/Search Tags:gene expression data, clustering, missing value, algorithm selection, seeds
PDF Full Text Request
Related items