Font Size: a A A

Research On 2D Spatial Gene Selection Algorithm Based On Unbalanced Gene Data

Posted on:2018-12-11Degree:MasterType:Thesis
Country:ChinaCandidate:M Z WangFull Text:PDF
GTID:2350330542478421Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
In recent years,with the rapid development of computer technology and wide application in the biomedical field,especially the appearance of DNA chip technology,which offers a new way for diagnostic and classification of cancer and its research and treatment of formation mechanism,and that it leads to amounts of gene expression data with high-dimensional features.Feature selection as a dimension reduction methods of high-dimensional data which can effectively eliminate redundant and non-related genes and retain those features which are highly correlated to classification tasks and quite small size.Very often,it can reduce the dimensionality curse and the computational complexity,also contribute to improve the recognition accuracy of classification models for cancer.Feature selection method is crucial to the study of gene expression data ofcancer and has practical significance of the study.Microarray gene expression data sets having the characteristics of high dimensions and small samples contain a high level of redundant and irrelevant gene variables for disease diagnosis purposes.Several vital genes can cause differences between samples.These characteristics makes the traditional feature selection algorithms are facing unprecedented challenges.In addition,the imbalance is an importance characteristic ofthe cancer gene datasets.Imbalance data processing methods include re-sampling methods,cost-sensitive learning and feature selection methods.When processing of cancer gene expression data sets in practice,current approaches which are merely used to process the high-dimensional and small sample data or unbalanced gene data face the performance bottlenecks.How to choose the efficient and reliable gene subset from high dimensional and unbalanced data is an emergent problem based on microarray gene data to analysis and diagnose cancer disease in the applications.Therefore,according to characteristics and existing problems of cancer gene expression data,we study the classification problems of unbalanced cancer gene expression dataset from the point of the feature selection and evaluation of gene subset in this thesis.The main works if this thesis are follows:(1)In order to overcome the shortcomings of the binary imbalanced feature selection ARCO(AUC and rank correlation coefficient optimization)and multiple imbalanced feature selection MAUCD(Using MAUC as the relevance metric to rank features directly)and MDFS(MAUC decomposition based feature selection method).We propose we proposed the revised Pearson correlation coefficient to assess the correlation between features,and uniformed the ranges of correlation and redundancy,then we got the APCO(AUC and improved pearson correlation coefficient optimization)algorithm.We proposed to measure the redundancy of features in Pearson coefficient revised by us for multiclass problems,and the MAUCP(MAUC and improved pearson correlation coefficient optimization)and MDFSP(MDFS and improved pearson correlation coefficient optimization)algorithms based on the framework of mRMR(Maximal relevance-minimal redundancy).And that overcome the algorithm MDFS easily converges to the locally optimal solution of the gene subsets.SVM(Support vector machine),NB(Naive bayes)and KNN(K-nearest neighbor)classifiers are adopted as the classification tools.Experimental results on seven two-class unbalanced gene datasets and three multi-class unbalanced gene datasets demonstrate that the proposed algorithms are superior to the original algorithms,and also outperform others classic gene selection algorithms.(2)In order to select the efficient gene subsets from the unbalanced gene datasets with high-dimensional and small samples.We propose the F2_measure,the normalized mutual information SU,normalized metric of the weight of feature and the dynamic weight based SFS(Sequential forward search)and SFFS(Sequential forward floating search)strategies.To advance the process of feature selection,a feature preselection method is proposed to reduce the size of the candidate feature subset.All of the algorithms and the innovations proposed are tested on three popular unbalanced gene datasets.The experimental results demonstrate that the proposed gene subset selection algorithms can detect the gene subsets with very promising classification capability and small size,and have also been fully verified the correctness of algorithms.(3)According to the relevance and redundancy between genes in the genes selection processes,a new feature selection algorithm named FSDI(Feature selection based on discernibility and independence of the feature)is proposed.The algorithm presents the novel definition for the discernibility and independence scores of a feature.The discernibility score of a feature is to measure the distinguishability of the feature to detect instances from different classes.The independence score is to measure the redundancy of a feature.In order to quickly determine the selected subset of gene,we construct a two-dimensional(2D)space with the feature's independence as y-axis and discernibility as x-axis,and the area of the rectangular corresponding to a feature's discernibility and independence in the 2D space is used as a criterion to rank the importance of the features.We first cluster all of the genes using the K-means cluster algorithm to select those typical genes from each cluster to form the preselected gene subset,then FSDI algorithm is carried out on the preselected gene subsets to get the optimal gene subsets.Experimental results on 5 classical gene expression datasets demonstrate that our FSDI method can detect gene subsets with a high efficiency,and the SVM and KNN classifiers have a better classification performance.
Keywords/Search Tags:feature selection, gene selection, imbalanced gene data classification, 2D space, preliminary gene selection, gene subset
PDF Full Text Request
Related items