Font Size: a A A

A Study On Some Issues About Gene Expression Data Analysis

Posted on:2012-08-01Degree:MasterType:Thesis
Country:ChinaCandidate:H P WangFull Text:PDF
GTID:2230330395962414Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Gene expression information can offer an important clue to understand the potential mechanism of both gene function and gene regulation, and is one important research content in the fields of biology and medicine. Microarray is an effective technique to detect the information of gene expression, which can simultaneously monitor the expression levels for several thousands of genes in a single experiment and can quickly create the expression data of these genes.This paper studies on some issues about gene expression data analysis, the summaries are as follow:1. Unlike the traditional methods for selecting feature genes, this paper proposes a new method (GSMDI-gene selection by multiple data integration) for feature gene selection by multiple data integration. For each one of multi-source data, first, we calculate the differentially expressed statistics of every gene on this data and then replace the data with these statistics for further analysis, at last, we train and test the different data of every single source by using features extracted from multi-source data, data for training classifiers and testing has the same origin, while data of different source is merely used for feature selection. The proposed approach is applied in experiments on four real microarray datasets and compared to the current conventional methods, the experimental results show that the proposed method outperforms the other methods.2. Multi-class classification problem is a hot and difficult problem in gene expression data analysis. This paper proposes a multi-class classification method based on the category tree, the tree structure can provide more biological significance. This method first constructs the complete graph based on the relationships among different categories, and use gene selection method during that process, then constructs the category tree, which is more helpful to improve the classification performance, at last, through reselecting genes on the category tree and training the classifiers based on SVM, so as to combine the classification and gene selection together. The proposed method is used and tested on two real datasets, the experimental result shows that the method is efficient and has a good performance in classification.3. The cross-validation is probably the most popular approach for estimating the classification error rate in classifying gene expression data. In order to reduce the variance of estimation, the procedure of cross-validation will be repeated to get the average result. However, the repetition number of cross-validation is generally set by an empirical value. This paper proposed two methods (FCI and TSE) for determining the repeat number of cross-validation based on the approximate confidence interval. The experimental results on real data show the empirical method of giving repeat number of cross-validation is usually unreliable and the proposed methods can determine cross-validation repeat number to achieve a pre-specified precision of the error rate. Furthermore, both methods can automatically adjust to meet the change of data, the value of k-fold and classification model.This paper studies on some important issues about gene expression data analysis, the research results can help and support biological and medical researchers to treat and understand biological and medical problems.
Keywords/Search Tags:microarray (gene expression) data, integrated data, gene selection, multi-class classification, error rate, cross-validation
PDF Full Text Request
Related items