Font Size: a A A

Research On Feature Selection And Classification Method Based On Random Forest For Medical Datasets

Posted on:2017-01-13Degree:DoctorType:Dissertation
Country:ChinaCandidate:D J YaoFull Text:PDF
GTID:1318330518472881Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Data mining in medicine is one of the important research directions of data mining technology.It is always the research hotspot in the field of computer science and medical research for many years.In recent years,the medical data mining is gradually changing from medical clinical data to the gene chip data,and many excellent data mining algorithms have been applied to a variety of medical research,such as Decision Tree,Support Vector Machine and Artificial Neural Network.However,it is difficult to directly apply the data mining technology to the medical data analysis tasks,which is characterized by high dimensional feature space,existing complex interactions among multiple features,imbalanced example classes,and the requirements of transparency and intelligibility of data mining results in medicine and life science,etc..Forest Random algorithm is a kind of ensemble machine learning algorithm based on decision tree.On the one hand,It has be widely used in medical data analysis because of its high classification accuracy,fast computing speed and the capacity aof identifying the main related features from the datasets with small marginal effect and complex interaction among features.On the other hand,there are studies that show that the classification ability and the stability of the random forest are weakened in the imbalanced datasets and high dimensional datasets.In this paper,aiming at the characteristics of medical datasets,such as high dimension,redundancy,correlation and sample class-imbalanced,this paper studies the feature selection and data classification method based on random forest,which is based on UCI standard data set,diabetes clinical dataset and gene chip datasets.The paper mainly focuses on the following aspects.Firstly,a new random forest classification method based on Bootstrap data sampling technique is proposed to improve the classification performance of random forest classifier on medical datasets with imbalanced sample classes.The algorithm firstly constructed class-balanced training datasets from the original training sample dataset using random re-sampling technology with a back,then trained a random forest classifier on each sampling training dataset,and finally the classification of unknown samples is determined by the majority voting method.Simulation experiments on UCI standard datasets show that the proposed algorithm can effectively improve the precision while maintain higher recall in class-imbalanced datasets compared with algorithms based on random under sampling and based on cost-sensitive.Secondly,a new filter feature selection algorithm based on random forest is proposed for the high dimensional feature space and height corelation among features of medical clinical dataset.The method is based on the variable importance scores of random forests,and the threshold of feature selection is determined through the iterative experiments.Then,the algorithm select top k features with the largest important score to result feature subsets,and finally training classifiers on the selected feature subset.Simulation experiments on UCI standard datasets and independently collected diabetes clinical datasets show that the classification accuracy of the proposed algorithm are obviously higher than algorithms based on Discernibility of Feature Subsets and Correlation based Feature Selector.Thirdly,aiming at the problems of complex mutual effects among multiple features and higher feature redundancy in medical datasets,a new Wrapper feature selection algorithm based on random forest is proposed.The method select the best optimal features subset from the original feature space based on variable importance scores of random forest algorithm,using the random forest's advantage of recognizing main related features from datasets with small marginal effect and the complex interactions among multiple features.It combines the SBS and SFS feature searching strategy and uses the classification accuracy of classifier on the feature subsets as evaluation function of the feature subset.Finally,the feature subset with the highest classification accuracy is used as the optimal feature subset.Simulation experiments on the UCI standard data sets and real clinical diabetes dataset show that the proposed algorithm can effectively improve the quality of the selected optimal feature subset and classification accuracy of SVM classifier,and its performance is better than the existed algorithms based on Filter and based on other criteria.Finally,aiming at the problem there are lots of noisy data,redundant genes and irrelated genes in microarray expression dataset,a new feature selection method based on the combination of Wrapper and Filter is proposed.The proposed algorithm first filter out apparently unrelated noise gene using Filter feature selection algorithm based on statistical,and then select the optimal subset of features using Wrapper feature selection method based on random forest.In the process of Wrapper feature selection,based on the variable importance measure of the random forest algorithm,a feature searching strategy is proposed,which combined the sequence forward feature search and the sequence backward search strategy and eliminated the unimportant and redundancy features iteratively in stratified feature space.Simulation experiments on micro-array expression datasets show that the proposed algorithm is superior to the existing algorithms in the classification accuracy.
Keywords/Search Tags:data mining in medicin, feature selection, micro-array expression data analysis, random forest, support vector machine
PDF Full Text Request
Related items