Font Size: a A A

Research And Application Of Spectral Clustering In Analysis Of Gene Expression Data

Posted on:2011-03-30Degree:MasterType:Thesis
Country:ChinaCandidate:X Y DengFull Text:PDF
GTID:2120360308958946Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Gene chip technology has achieved great development and been widly applied in biology fields, but it generates a large number of gene expression data. How to analyze these massive data has become a new problem to molecular biologist, so bioinformatics, as a rapidly emerging discipline, has developed into a frontier area of research. Gene expression data reflects the abundance of mRNA generated in transcription process in cells from microarray experiment. By analyzing these data, we can obtain the function and the control information of genes. Research on gene expression data has become an active cross-subject of life sciences, mathematics and computer science, as well as one of the hotspot in the bioinformatics.Clustering technology is an important method to analyze the massive data. By clustering, the similar expression genes can be divided into the same cluster, so we can infer unknown gene`s function through known functions of genes in the same cluster.The thesis mainly researches on the clustering used to analyze gene expression data, and the works are listed as follows:①Cluster analysis algorithms which are usually adopted to analyze gene expression data depend too much on the shape of the data distribution, and the results converge at local optimum. So in this thesis we try to use the spectral clustering to analyze gene expression data. Spectral clustering is a novel algorithm based on the vector of data matrix, and is also an algorithm that can classify graph according the weight between the vertices in the graph. This algorithm does not depend on the shape of data distribution, and it can converge at global optimum.②As the spectral clustering can not automatically find the best number of clusters, so it needs to iteratively compute eigenvalues and eigenvectors, consequently, it costs fairly much time. In this thesis we design a method called VP to automatically find the number of clusters in spectral clustering algorithm. This method can reduce the time complexity, so it is quite necessary for large gene expression data analysis.③Based on the high dimensionality but small sample size of gene expression data and combined with the knowledge of the biological fields, we propose to raise the weight of certain samples to get more accurate clustering results.④Focusing on the purpose of gene expression data clustering analysis, we propose a method called ARI to calculate the accuracy of clustering result. And then we adopt ARI as an external standard and the classical adjust-Fom as an internal standard to evaluate and analyze the result of different clustering algorithms.⑤We design a serial of simulative experiments for the research works mentioned above. The results show: 1)Spectral clustering algorithm can make a better result for any shape of data distribution; 2) Spectral clustering algorithm performs better for gene expression data than hierarchical clustering algorithm and Kmeans; 3) VP method can find the best clustering number automatically; 4) The results of clustering are more accurate after raising the weight of certain samples.⑥We find the relationship between the parameterθand the parameterσin each dataset used in this thesis and then get the proper ranges ofθaccording the relationship.
Keywords/Search Tags:Gene expression data, Bioinformatics, Spectral clustering, Weights of the sample, Clustering accuracy
PDF Full Text Request
Related items