Font Size: a A A

Research On Two-Stage Feature Selection Methods In Machine Learning

Posted on:2016-08-23Degree:MasterType:Thesis
Country:ChinaCandidate:G M LiuFull Text:PDF
GTID:2348330470969457Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the advent of the era of big data, it becomes an urgent problem that how to process the data quickly and explore useful information. As a pretreatment process in machine learning and data mining, feature selection has become a hotspot in academia. Machine learning algorithm has not be the bottlenecks in big data processing. In recent years, many studies have shown that irrelevant and redundant features greatly affect the accuracy and efficiency of machine learning algorithms. So, it is need to choose the appropriate feature selection algorithm to select effective features from massive amounts of data, in order to serve for machine learning algorithms efficiently.This paper focuses on research on feature selection methods in machine learning. Our purpose is to choose the most efficient features from high dimensional features, in order to improve the efficiency and reduce the running time of algorithms. The main contents of this paper is divided into the following parts:Firstly, starting from the classification of feature selection, based on the relationship between feature selection and machine learning algorithm, feature selection can be divided into filter model and wrapper model. Filter model has characteristic of efficient and high applicability, and it can detect and delete irrelevant features. Wrapper model has characteristic of high accuracy and optimum feature subset, and it can form optimum subset without redundant features. Combined with the two feature selection models, we propose the two-stage feature selection method.Secondly, considering high dimensional binary data with only values of 0 and 1, we define diff-criterion to measure the relationship between features. Compared to traditional methods, it greatly improves the efficiency of the correlation analysis.Thirdly, considering the detection of the redundant features, from the perspective of correlation analysis, we propose a non-linear correlation analysis based on maximum information coefficient. This method can compute the level of non-linear correlation between features. This can further reduce the dimension of feature subset.Finally, based on the maximum relation minimum redundant theory, we propose two feature selection methods. One method is for binary data, by diff-criterion and Markov Blanket. Another is based on symmetrical uncertainty and maximum information coefficient, in order to get the optimal subset.
Keywords/Search Tags:feature selection, machine learning, diff-criterion, maximum information coefficient
PDF Full Text Request
Related items