Font Size: a A A

Research On Several Key Technologies Of Gene Expression Data Mining

Posted on:2011-07-02Degree:DoctorType:Dissertation
Country:ChinaCandidate:R C CaiFull Text:PDF
GTID:1118360308963887Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Bioinformatics is an important research branch of computer science. As an important topic of bioinformatics, gene expression data mining provides an insight into the physiological state of the cells, and has a lot of real-world applications in disease-causing gene discovery, disease diagnosis, therapeutic response prediction and so on. Small number of sample size, high dimensionality, low signal to noise ratio are three main challenges of gene expression data mining. We try to solve these problems from the following aspects: selecting the most informative genes, discovering the most interesting gene expression association rules and constructing robust classification models. The contributions are as follows:1. A conditional mutual information based gene selection algorithm (MIGS) is proposed. In MIGS, the genes are sequentially forward selected according to their contribution to reducing the uncertainty of label given the selected gene set. This approach reduces the redundancy of gene set which used to be a tough task for the traditional feature selection algorithms. An approximate evaluation of the conditional mutual information is devised to assess the tradeoff between generalization ability and accuracy of the conditional mutual information. A pruning strategy is also devised to improve the efficiency of MIGS. The experimental results in real-world gene expression data sets show the effectiveness and efficiency of MIGS.2. In order to improve the classification accuracy, we proposed a Random Subspace Method based gene selection method, ERSM. In ERSM, firstly subsets of genes are randomly generated; then Least Square Support Vector Machines are respectively trained on each subset and thus produce the relative importance of each gene; finally, the importance of each gene obtained from these randomly selected subsets is combined to constitute its final importance. This divide-and-conquer framework elegantly conquers the disadvantages of Support Vector Machine Recursive Feature Elimination (SVM-RFE), such as high computational complexity, low generalization ability. The combination between ERSM and Support Vector Machine (SVM) achieves high classification accuracy in the real-world gene expression data sets, is a promising method for the gene expression data based disease diagnosis.3. For the interpretability of classification result, association rules are employed to represent the gene expression patterns and classify the gene expression samples. We propose two lattice based interestingness measures for ranking the rules within equivalent rule group. Based on these interestingness measures, an incremental Apriori-like algorithm is designed to select top-k interesting lower bound rules from the rule group. Moreover, we present an improved classification model, IRCBT, to fully exploit the potential of the selected rules. Our empirical studies on five gene expression datasets show that the proposed methods improve both the effectiveness and efficiency of the rule extraction and classifier construction over gene expression data sets.4. Considering a lot of information has been lost in the discretization procedure, kernel density estimation is used to evaluate the interestingness of association rules. This approach makes the most of original gene expression data and can find the most interesting association rules. For the efficient discovery of the rules according to the new interestingness measure, an association rule mining procedure is divsed to disover the approximate top-k association rule, which is a trade-off between computational time and accuracy. Finally, our model deals with over-fitting problem of the classification model by eliminating redundant rules using conditional independence test. Generally, the proposed framework achieves classification accuracy as high as the'black box'like classification model, while with very interpretable results.
Keywords/Search Tags:Bioinformatics, Gene Expression Data, Feature Selection, Conditional Mutual Information, Random Subspace Method, Association Rule, Lattice, Kernel Density Estimation
PDF Full Text Request
Related items