Font Size: a A A

Research On Feature Selection And Feature Subset Redundancy For Gene Expression Data

Posted on:2020-02-17Degree:MasterType:Thesis
Country:ChinaCandidate:C DuFull Text:PDF
GTID:2480306305997839Subject:Statistics
Abstract/Summary:PDF Full Text Request
Genes are DNA fragments with genetic effects,genes support the basic structure and performance of life.Studies have found that many diseases are caused by mutations in genes.Therefore,the research of disease diagnosis based on gene expression data has become an important subject in biomedicine.Gene expression data are characterized by high latitude and small sample size,feature selection is needed in data preprocessing.Feature selection can not only effectively reduce the data dimension and subsequent workload,but also help us identify important features and reduce the impact of noise on the data set.Therefore,further research work has been done on feature selection and feature subset redundancy for gene expression data.The main work is as follows:(1)A single feature selection method often has its own limitations.For the same disease classification model,this limitation will lead to different classification results.In order to break through this limitation,this paper adopts the method of ensemble learning,uses a variety of single feature selection methods to learn the same data set,and aggregates a variety of results.Data analysis on multiple binary and multi-classification common tumor gene expression datasets shows that the feature subset selected by integrated feature selection method has better classification and prediction performance.(2)Aiming at the problem of feature subset redundancy removal,a heuristic feature subset redundancy removal algorithm is proposed.The main idea of heuristic feature subset redundancy removal is to retain the most important part of feature genes and remove the feature genes which have strong correlation with the feature subset.This method does not change the original attributes of features,and can be regarded as the second selection of features.The experimental results show that,compared with the original feature subset,the feature set after redundancy removal can often achieve higher classification accuracy,but the classification effect is affected by the redundancy removal threshold.(3)On the problem of feature subset redundancy removal,another feature subset redundancy removal algorithm based on principal component analysis is proposed.This method uses principal component analysis to eliminate the correlation between features,and constructs a new feature set on the original feature subset.This is a feature extraction method,which changes the original feature space.The experimental results show that for most data sets,the newly constructed feature set can achieve higher classification accuracy than the original feature subset.Compared with traditional feature selection methods,the proposed integrated feature selection method and feature subset redundancy removal method both show better classification performance.However,feature selection and redundancy removal are both dimensionality reduction processes,which will cause the loss of original information.For a specific data set,how to determine the number of feature subsets and minimize the loss of information is a problem that needs further study in this paper.
Keywords/Search Tags:Gene expression data, Feature selection, Disease classification, Feature subset redundancy, Principal component analysis
PDF Full Text Request
Related items