Font Size: a A A

Based On Microarray Data Annotation Information

Posted on:2011-11-28Degree:DoctorType:Dissertation
Country:ChinaCandidate:N MaFull Text:PDF
GTID:1264330401455886Subject:Biomedical engineering
Abstract/Summary:PDF Full Text Request
Microarrays enable simultaneous measurement of expression levels of tens of thousands of genes and have found widespread applications in biological and biomedical research. The challenge of interpreting the vast amount of data from microarrays has led to the development of new methods in the fields of computational biology and bioinformatics. However, most algorithms make use of expression values only. Other attributes of genes, such as functional similarities, pathway information and protein-protein interactions, are ignored. In order to take advantage of these attributes, we have incorporated annotation resources into microarray analysis. This paper focuses on the two most important issues:gene selection and gene clustering.A basic, yet challenging task is the identification of changes in gene expression that are associated with particular biological conditions, which is called gene selection. Most gene selection algorithms suffer from the dimensionality issue and the noise inherent in expression data. Five gene selection algorithms including fold change (FC), t-test, significance analysis of microarray (SAM), Baldi’s empirical Bayes method (Baldi) and linear models for microarray analysis (Limma) were compared. The results revealed that Limma is the most powerful one in most situations. However, genes are assumed to be expressed independently in Limma. Correlation between genes is a very informative resource but is not considered. Keeping the form of prior distribution in Limma, an entirely new prior estimation method was proposed. This method is noted as Deam. It incorporates the functional similarities between genes. Functional similarities are measured by gene ontology annotations. Three publicly available microarray experiment data sets and simulated data were used to evaluate the method proposed. The results obtained reveal that Deam has a performance better than Limma in detecting differently expressed genes in most cases. In addition, it has more potential as more algorithms for measuring functional similarities between genes are proposed and existing ones are improved. For the convenience of biological researchers without programming and statistical background, the implementation of Deam in a web front-end using the RApache model was provided.Gene clustering is another important approach in microarray analysis. Compared to gene selection, clustering is a more complicated open problem. Difficulties remain in selection of distance metrics, selection of clustering algorithms and evaluation of clustering results. An external criterion was proposed using Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway annotations to measure the performance of clustering algorithms. The criterion is noted as PS in this paper. After the feasibility of this external criterion was proved, it was used to compare the six commonly used clustering algorithms including four different types of hierarchical clustering, k-medoids clustering and self organizing map. It was shown that hierarchical clustering using Ward’s method and k-medoids outperform the others. In literatures regarding clustering, principal component analysis is sometimes applied to reduce the dimensionality of the gene expression data prior to clustering. Using the external criterion suggested, we tried to study the effectiveness of principle components in capturing the cluster structure. The quality of clusters obtained from the original data and those obtained after projecting onto the subsets of the principal component axes were compared. The result showed that clustering with the set of principle components instead of the original variables was not necessarily improved. Overall, we would not recommend using principle components as input for clustering in most situations.The major innovations in this paper are summarized below.1. A new gene selection method, Deam, was proposed, which took full advantage of functional similarities between genes. Experiment results revealed that Deam had a better performance than current methods in detecting differently expressed genes in most cases.2. Instead of using two univariate normal distributions with different means as suggested in prior research, a new method using multivariate normal distributions to generate simulated expression data was proposed. The covariance matrices in such distributions reflect the correlations between their components. 3. Given the functional similarity matrix for the whole human genome, a method was proposed with high efficiency to construct platform specific similarity matrices. It reduces the dimension of matrices from n×n to n×d0’, where n>>d0’.4. The implementation of Deam in R language and a web front-end were provided.5. An external criterion using pathway annotations to measure the performance of clustering algorithms was provided. The performance was reflected in the aggregation of genes in the same pathway. The feasibility of this criterion was proved.6. The effectiveness of principle components in capturing the cluster structure was studied. Using principle components as input for clustering was not recommended.
Keywords/Search Tags:Microarray, Gene selection, Empirical Bayes, Gene ontology, Clustering
PDF Full Text Request
Related items