Font Size: a A A

Several Key Techniques Of Analyzing Gene Expression Data

Posted on:2008-08-31Degree:DoctorType:Dissertation
Country:ChinaCandidate:J LiFull Text:PDF
GTID:1100360245497389Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Analysis of gene expression needs many knowledge such as statistic, artificial intelligence, computer and science biology. It is also a challenge how to apply gene expression data to these areas. Classification of complex diseases, identification of differentially expressed genes and analysis of relations between genes are three important issues of analyzing gene expression data. Although many methods have been proposed for solving these problems, there are still deserved deeply studied topics about the these techniques. New methods and schemes relating to the three crucial techniques have been proposed in this thesis.A new framework is proposed for detecting differentially expressed genes. It includes 4 steps: test, evaluation, ranking and selection. In the first step, multiple statistical test methods are combined to detect differentially expressed genes in order to overcome the disadvantage of the single statistical test method. In the second step, the degree that each gene is differentially expressed is evaluated using residual value. In ranking step, we rank genes according to their residual values, which overcomes the disadvantage of p-value ranking method. In selection step, a small group of important genes are selected from thousands of genes according to a statistical threshold, which overcomes disadvantage of other methods and helps biologists to select crucial genes for further study. Every statistical test method has its advantages and disadvantages in any particular case. Therefore, they may suffer from inability to detect important differentially expressed genes. To address the problem, the new framework reevaluated all genes that are not selected important genes. Experimental results on 4 public cancer expression data have demonstrated that the proposed framework can effectively find differentially expressed genes. It is a challenging problem to evaluate the methods that are used to find differentially expressed genes. We proposed a novel simulation method based on real gene expression data for objectively evaluating and comparing the proposed method with other statistical test methods. Molecular experiments and simulation results show that the precision of the proposed method is significantly higher than those of three statistical test methods(KS-test,t-test and WRST). For the classification of complex diseases, two new classifiers were proposed according to the characteristics of gene expression data: visualization classifier based on difference of distribution of feature genes and gene pair classifier with simple decision rule. In visualization classification method, signal-noise-ratio (SNR) is first used to select feature genes from gene expression data; Then, the mean expression values of feature genes are computed; Finally distribution of feature genes in two groups of samples is plotted, which is used to classify cancer samples. Selecting randomly a sample, Observing the distribution of feature genes of the sample, it is assigned to normal group if the sample shows the characters of normal sample; otherwise disease group. Applications of the proposed method to several public gene expression data have demonstrated that it can effectively classify complex diseases. One of advantages of the proposed is that classification progress is transparent. Compared with other classification methods, the proposed method can help biologists to find more information, such as changes of gene expression levels and difference of samples.Biologists expect the decision rule derived from classifier is easy to understand in a biological significance. We proposed a new method based on information gene pair to address the problem. The rule derived from the proposed classifier is simple,readily interpretable by biologists in biological significance. For a pair of genes, a classifier based on a linear regression model is first construct; Then the performance of the classifier is evaluated using classification accuracy; Finally, all classifiers based on gene pairs are compared, and the classifier with higher accuracy is selected. It is possible that multiple models achieve the same accuracy, here we use a secondary criterion, residual value signal-noise-ratio (RVSNR ) to rank the models with the same accuracy. When two models have the same accuracy, the model with higher RVSNR is chose as optimal model. The single model may fail to classify some samples when structures of gene expression data are complex. To obtain higher classification accuracy, we combined genetic algorithm (GA) and multiple models to classify complex diseases. Experimental results on several gene expression data have demonstrated that 100% leave-out cross validation (LOOCV) accuracy can be achieved using a single model based on optimal feature gene pair or combining multiple top-ranked classification models. The proposed method performs well in finding a large number of excellent marker gene pairs and successfully identified important cancer-related genes that had been validated in previous biological studies while they were not discovered by the other methods. Biologists can predict the function of unknown genes through the rule derived from classification model and find new patterns, structures and information.We developed a approach to analyze relation of genes in different samples combining clustering technique, GO term analysis, statistical analysis and gene network. We identify functional modules and biological process that have significant changes in different samples. Application to colon cancer data has demonstrated that the proposed method can identify the functional modules and biological process that have tightly relation with cancer development.
Keywords/Search Tags:Gene expression data, Differentially expressed genes, Classification, Clustering, Gene network
PDF Full Text Request
Related items