Font Size: a A A

Random Forest Feature Selection

Posted on:2012-03-01Degree:MasterType:Thesis
Country:ChinaCandidate:Q C WangFull Text:PDF
GTID:2218330368987826Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Now, the scale of data set increases quickly, how to filter useful information is one big problem. The invention of data mining technology is used to help analyze complex data set. With the development and self-improvement of machine learning algorithms, there are more and more machine learning methods which are used in data mining field. Through filtering noise data and selecting feature subset so on, it can get valuable information effectively from large data set, it is prepare for the following analysis and research in reality.Random forest is one excellent machine learning method, and it has been successfully used in many fields. Recently, random forest has been paid more and more attention in feature subset selection benefitting from its development and self-improvement. In this paper, it does a lot of researches on random forest when used to analyze metabolomics data. In order to overcome the influence of noise features for classification accuracy, with the help of artificial contrast variable technology, random forest can improve its classification accuracy. When random forest is used to select feature subset, it measures feature's importance. However, it's not enough just scoring for features one time, because it is affected by many factors. Based on the characters of experiment dataset, it gives one restricted random forest with recursive feature elimination. The result of feature subset selection is depended on the design of process strategy and the design of model construct strategy. For covering the advantage of each method to comprehensive understand experiment object, it also gives one new ensemble strategy including random forest, support vector machine and genetic algorithm.This paper focuses on RF to select feature subset in metabolomics. As one member of four big branches in biosystematics, metabolomics can tell what have been changed in our body, and it is meaningful to disease diagnosis and treatment. Using artificial contrast variable delete noise feature, the classification accuracy of random forest is improved from 90.7% to 94.4%; using RF-RFE select features, it filters 18 important and identified features with satisfactory classification accuracy; using the ensemble strategy select features, it filters 31 important features and the classification accuracy is 100%.
Keywords/Search Tags:Random Forest, Feature Selection, Metabolomics, Data Mining, Machine Learning
PDF Full Text Request
Related items