The Design And Implementation Of Gene Signature Collection And Gene Enrichment Analysis

Posted on:2016-07-01

Degree:Master

Type:Thesis

Country:China

Candidate:Y Z Yan

Full Text:PDF

GTID:2308330461977180

Subject:Software engineering

Abstract/Summary:

PDF Full Text Request

Nowadays, discovering subtypes for tumors and personalized treatment based on gene expression data are well-developed in bioinformatics. However, the occurrence and development of tumor is usually the co-expression process of a group of similar genes. Therefore, analysis based on gene sets can find out more information than analyzing single genes. Gene signature is a set of genes that have special biological phenotypes in one cell, which is usually extracted from experimental publications by curators. These gene signatures are stored in databases for further analysis. At present, researches about gene signatures mainly include two parts:one is to collect gene signatures, another is to analyze gene signatures.In this paper, we propose a framework for gene signature collection and analysis and also implement a platform to display the whole framework. In the collection part, we aim to improve the efficiency by defining articles. First we use Principal Component Analysis (PCA) to generate optimized keywords. Then we use web scrawling techniques to download articles from the Internet and transfer these files to text files. Finally Support Vector Machine (SVM) was applied to divide articles into two classes:articles with and without gene signatures. In the clustering analysis part, we first map gene expression data to gene sets with signature databases. (In this way, we can reduce the dimension of the data and analyze across platforms at the same time). Then, we compare most of the algorithms about Non-negative Matrix Factorization (NMF) available and apply the fastest and the most efficient algorithm called NMF based on Greedy Coordinate Descent (NMF-GCD) to reduce the dimension. Finally, we cluster the results of decomposition by controlling the sparseness and membership. In this way, we can enrich genes, cluster phenotypes and find modules in which groups of gene sets are coordinately associated with groups of phenotypes across multiple studies at the same time.The efficiency of the curation process has been raised from 37%to 94%with our collection method. Besides, we apply the gene enrichment analysis algorithm to simulated data and compare the results with iBBiG, hierarchical clustering and FABIA. The results show that our algorithm outperforms commonly used clustering methods, discovers overlapping clusters of diverse sizes and is robust in the presence of noise. This confirms that the algorithm can enrich genes, cluster phenotypes and find modules with a higher accuracy and efficiency. Finally, we use the clustering algorithm to 3 databases concerning Breast Cancer, which validates the rationality of overall clustering process.

Keywords/Search Tags:

Gene Enrichment Analysis, Gene Signature, Non-negative MatrixFactorization, Support Vector Machine, Greedy Coordinate Descent

PDF Full Text Request

Related items

1	The Application Research Of Support Vector Machine In Non-spherical Distribution Data Set And Tumor Gene
2	Gene Selection And Cancer Classification Based On Optimization Algorithm And Support Vector Machine
3	Research On Algorithms For Gene Recognition And Microarray Data Recognition
4	Research Of Support Vector Machine For The Analysis Of Gene Expression Data
5	Data Analysis Of Cancer Gene Expression Based On SVM-RFE Algorithm
6	Tumor Gene Identification Study On Support Vector Machine Classification Model
7	Tumor Gene Chip Data Clustering Analysis Algorithm
8	Study On Gene Identification Using Signal Processing Methods
9	Support Vector Machine And Its Application In Gene Expression Data
10	Research On Extraction Of Feature Gene Subset Based On A Hybrid Between Genetic Arithmetic And Support Vector Machines