Font Size: a A A

The Design And Implementation Of Gene Signature Collection And Gene Enrichment Analysis

Posted on:2016-07-01Degree:MasterType:Thesis
Country:ChinaCandidate:Y Z YanFull Text:PDF
GTID:2308330461977180Subject:Software engineering
Abstract/Summary:PDF Full Text Request
Nowadays, discovering subtypes for tumors and personalized treatment based on gene expression data are well-developed in bioinformatics. However, the occurrence and development of tumor is usually the co-expression process of a group of similar genes. Therefore, analysis based on gene sets can find out more information than analyzing single genes. Gene signature is a set of genes that have special biological phenotypes in one cell, which is usually extracted from experimental publications by curators. These gene signatures are stored in databases for further analysis. At present, researches about gene signatures mainly include two parts:one is to collect gene signatures, another is to analyze gene signatures.In this paper, we propose a framework for gene signature collection and analysis and also implement a platform to display the whole framework. In the collection part, we aim to improve the efficiency by defining articles. First we use Principal Component Analysis (PCA) to generate optimized keywords. Then we use web scrawling techniques to download articles from the Internet and transfer these files to text files. Finally Support Vector Machine (SVM) was applied to divide articles into two classes:articles with and without gene signatures. In the clustering analysis part, we first map gene expression data to gene sets with signature databases. (In this way, we can reduce the dimension of the data and analyze across platforms at the same time). Then, we compare most of the algorithms about Non-negative Matrix Factorization (NMF) available and apply the fastest and the most efficient algorithm called NMF based on Greedy Coordinate Descent (NMF-GCD) to reduce the dimension. Finally, we cluster the results of decomposition by controlling the sparseness and membership. In this way, we can enrich genes, cluster phenotypes and find modules in which groups of gene sets are coordinately associated with groups of phenotypes across multiple studies at the same time.The efficiency of the curation process has been raised from 37%to 94%with our collection method. Besides, we apply the gene enrichment analysis algorithm to simulated data and compare the results with iBBiG, hierarchical clustering and FABIA. The results show that our algorithm outperforms commonly used clustering methods, discovers overlapping clusters of diverse sizes and is robust in the presence of noise. This confirms that the algorithm can enrich genes, cluster phenotypes and find modules with a higher accuracy and efficiency. Finally, we use the clustering algorithm to 3 databases concerning Breast Cancer, which validates the rationality of overall clustering process.
Keywords/Search Tags:Gene Enrichment Analysis, Gene Signature, Non-negative MatrixFactorization, Support Vector Machine, Greedy Coordinate Descent
PDF Full Text Request
Related items