Font Size: a A A

Research Of Feature Selection For Text Classification

Posted on:2016-02-07Degree:MasterType:Thesis
Country:ChinaCandidate:X D WangFull Text:PDF
GTID:2348330470470424Subject:Computer technology
Abstract/Summary:PDF Full Text Request
Feature selection, as a preprocessing step to machine learning, is effective in reducing dimensionality, removing irrelevant data, increasing learning accuracy, and improving result comprehensibility. Nowadays, the growth of the high-throughput technologies has resulted in exponential growth in the harvested data with respect to both dimensionality and sample size. Efficient and effective management of these data becomes increasing challenging. Traditionally manual management of these datasets to be impractical. Therefore, data mining and machine learning techniques were developed to automatically discover knowledge and recognize patterns from these data.Existing feature selection algorithms into filter model, package model and embedded in the model three kinds from the structure. Filter model feature selection algorithms and classification models are separated, their speed and versatility advantages. However, the choice of filter model is characterized by the problems: First, correlation analysis, feature selection is based on local optimal feature subset, rather than globally optimal subset of features; the second is the selected feature only relevant analysis, there is no redundancy analysis effectively among its features, there is no redundancy feature set will be removed from the sub-features, which also reduces the learning performance of the classification model. This paper uses information theory information gain and mutual information and other methods to carry out a subset of the assessment of the bulk of the feature set. Thereby distinguishing between redundancy features and category relevance, features and characteristics, to analyze the relevance and redundancy. The main feature is the choice between low redundancy and big contribution for the category subset of features. Thus proposes two feature selection algorithms: one is based on the mechanism of competition to win feature selection algorithm; the second is based on the correlation analysis and redundancy characteristics of the new framework. Focus of this paper is as follows:Feature selection method for the selected feature in the dataset is local characteristic; we propose a mechanism based on the competition winning feature selection methods. By information gain method, the data set is calculated for each feature, then, each sample represents an individual competition, feature selection, through competition winning feature classification effects are greatly improved.Feature selection is applied to reduce the number of features in many applications where data has hundreds or thousands of features. Existing feature selection methods mainly focus on finding relevant features. In this paper, we show that feature relevance alone is insufficient for efficient feature selection of high-dimensional data. We define feature redundancy and propose to perform explicit redundancy analysis in feature selection. A new framework is introduced that decouples relevance analysis and redundancy analysis. We develop a correlation-based method for relevance and redundancy analysis, and conduct an empirical study of its efficiency and effectiveness comparing with representative methods.
Keywords/Search Tags:machine learning, feature selection, information gian, mutual information, tournament winners
PDF Full Text Request
Related items