Font Size: a A A

Application Of Runway Test And K - S Test In Gene Selection

Posted on:2016-02-13Degree:MasterType:Thesis
Country:ChinaCandidate:Q F HuFull Text:PDF
GTID:2270330473461430Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
A large number of DNA sequences can be detected by the gene chip technology, which lead to the amount of gene expression data sets. Recognizing the distinguishable genes from tens of thousands of genes to tell cancer or tumor patients from normal people has become one of the very popular area in medicine, bioinformatics and artificial intelligence in 21st century. Gene expression data are typical high-dimensional data with small samples, which usually contain a lot of noisy or redundant genes. Because of the very small proportion of the distinguishable genes in gene expression dataset, gene feature selection has become a challenging problem when to analyze the gene expression datasets. If a gene has the ability to distinguish between the normal samples and tumor samples significantly, the expression level of this gene must have significant differences in different categories. So, many scholars have put forward a series of Filter algorithm based on nonparametric statistics and parameter statistics, and have achieved comparable study results. To overcome the disadvantage of the existing statistical test algorithms in gene selection study, Runs test algorithm is proposed in this thesis to select the distinguishable genes. However, the nonparametric statistical test methods ignore the redundancy between genes, so this thesis propose a new gene selection algorithm based on K-S test and mRMR (Minimum Redundancy-Maximum Relevance) principle. The main work and innovations in this thesis are as follows.(1) To overcome the disadvantages of Wilcoxon test and T test on gene selection, Runs test is adopted to select the distinguishable genes between normal people and the cancer patients. The experimental results on 3 popular and classic gene expression datasets demonstrate the power of Runs test on gene selection problem, which can detect the smaller size of the gene subset with higher classification accuracy than Wilcoxon test and T test can do under a given significance level.(2) To overcome the disadvantage of nonparametric statistical test methods ignoring the redundancy between genes, a gene feature selection algorithm based on K-S Test and mRMR principle is proposed in this thesis. This algorithm comprises two steps. One step is to select the distinguishable features in K-S Test, and the other one is to use the minimum redundancy-maximum relevance principle to select gene subset from the features selected by the former step. SVM is adopted as the classification tool, and the criteria of F1_measure, accuracy and AUC are used to evaluate the performance of the classifiers on the selected gene subsets. The proposed gene subset selection algorithm is compared with K-S、mRMR、RELIEF and FAST algorithms. The average experimental results of 10 runs of the aforementioned gene selection algorithms on 5 classic gene expression datasets demonstrate that the new K-S and mRMR based algorithm is significantly faster than mRMR, and the performance of it is better than that of K-S、mRMR、RELIEF and FAST.
Keywords/Search Tags:Gene Selection, Runs test, K-S test, mRMR, Support Vector Machine
PDF Full Text Request
Related items