Font Size: a A A

Research On Gene Selection With Gene Expression Data

Posted on:2017-04-19Degree:MasterType:Thesis
Country:ChinaCandidate:H H ChenFull Text:PDF
GTID:2284330488953574Subject:Operational Research and Cybernetics
Abstract/Summary:PDF Full Text Request
Cancer treatment becomes one of the focus in the world, which is of great challenge in the medical community because cancer is complicated and changeable. Cancer is a type of genetic diseases caused by gene differential expression in cells. DNA chip technology has proven to be a great breakthrough in molecular biology, monitoring thousands of gene expression in a single experiment. The development of the technology provides a new way for the treatment of cancer. With the gene expression data, finding genes that are relevant to cancer, identifying different cancer classes or subclasses with similar morphological appearances is of great significance for better detection, treatment and prognosis.Since gene expression data consists of a large number of genes with relatively small samples, it may give rise to "curse of dimensionality", and it is difficult for classical data analysis techniques to analyze the data efficiently. Cancers are usually marked by a change in the expression levels of certain genes. So how to reduce the dimension of data, remove redundant genes and select the informative genes associated with cancer in order to improve the accuracy of the cancer type identification is crucial in the study of gene expression data analysis.Because of this, we focus on the selection algorithm of informative genes in this paper, trying to select a small number of informative genes that are strongly correlate with cancers from huge amounts of data. The main work of this thesis are as follows:1. Based on the idea of SCAD algorithm, this thesis proposes a new gene selection method named KBCGS algorithm, combining supervised learning and unsupervised learning, genes are weighted according to their discriminant ability. To minimize clustering objective function and at the same time to obtain optimal weights of genes, thereby selecting informative genes. Based on KFCM algorithm, this thesis introduced the kernel function and global adaptive distance, considering the nonlinear relationship between data, which can effectively remove redundant genes, and improve the effectiveness of the algorithm. This algorithm is efficient, simple, and easy to extend.2. Combining with KNN and SVM classifiers, we do the experiment on eight classical data sets. KBCGS algorithm is compared with five popular gene selection methods, and the results show that our method has obtained better or the similar performances. Especially on Lung and NCI60 data sets, which are hardly classify, the classification accuracy of the proposed approach are 87% and 80.52%, significantly higher than that of other methods. The experiment verifies the effectiveness of the proposed method.3. In Prostate, AMLALL and Lymphoma data sets, we compare our method with previous works with respect to the biological significance of informative genes and query the gene annotation through NCBI website. This suggests that we can choose the genes with strong biological meaning, and the selected genes can be "biomarkers" to detect cancer in clinic, which proves that this method is of practical significance.
Keywords/Search Tags:gene expression data, cancer, gene selection, multi-classification, clustering
PDF Full Text Request
Related items