Research Of Feature Selection For Text Classification

Posted on:2016-02-07

Degree:Master

Type:Thesis

Country:China

Candidate:X D Wang

Full Text:PDF

GTID:2348330470470424

Subject:Computer technology

Abstract/Summary:

PDF Full Text Request

Feature selection, as a preprocessing step to machine learning, is effective in reducing dimensionality, removing irrelevant data, increasing learning accuracy, and improving result comprehensibility. Nowadays, the growth of the high-throughput technologies has resulted in exponential growth in the harvested data with respect to both dimensionality and sample size. Efficient and effective management of these data becomes increasing challenging. Traditionally manual management of these datasets to be impractical. Therefore, data mining and machine learning techniques were developed to automatically discover knowledge and recognize patterns from these data.Existing feature selection algorithms into filter model, package model and embedded in the model three kinds from the structure. Filter model feature selection algorithms and classification models are separated, their speed and versatility advantages. However, the choice of filter model is characterized by the problems: First, correlation analysis, feature selection is based on local optimal feature subset, rather than globally optimal subset of features; the second is the selected feature only relevant analysis, there is no redundancy analysis effectively among its features, there is no redundancy feature set will be removed from the sub-features, which also reduces the learning performance of the classification model. This paper uses information theory information gain and mutual information and other methods to carry out a subset of the assessment of the bulk of the feature set. Thereby distinguishing between redundancy features and category relevance, features and characteristics, to analyze the relevance and redundancy. The main feature is the choice between low redundancy and big contribution for the category subset of features. Thus proposes two feature selection algorithms: one is based on the mechanism of competition to win feature selection algorithm; the second is based on the correlation analysis and redundancy characteristics of the new framework. Focus of this paper is as follows:Feature selection method for the selected feature in the dataset is local characteristic; we propose a mechanism based on the competition winning feature selection methods. By information gain method, the data set is calculated for each feature, then, each sample represents an individual competition, feature selection, through competition winning feature classification effects are greatly improved.Feature selection is applied to reduce the number of features in many applications where data has hundreds or thousands of features. Existing feature selection methods mainly focus on finding relevant features. In this paper, we show that feature relevance alone is insufficient for efficient feature selection of high-dimensional data. We define feature redundancy and propose to perform explicit redundancy analysis in feature selection. A new framework is introduced that decouples relevance analysis and redundancy analysis. We develop a correlation-based method for relevance and redundancy analysis, and conduct an empirical study of its efficiency and effectiveness comparing with representative methods.

Keywords/Search Tags:

machine learning, feature selection, information gian, mutual information, tournament winners

PDF Full Text Request

Related items

1	A Study On Feature Selection Algorithms Using Information Entropy
2	The Research Of Multi-label Feature Selection Based On Mutual Information And Feature Label Relationship
3	Research On Feature Selection Algorithm Based On Lasso And Mutual Information
4	Feature Selection Research Based On Maximum Relevance Minimum Redundancy
5	Research On Feature Selection Algorithms In Machine Learning
6	Research On Feature Selection Algorithm Based On Mutual Information
7	Research On Mutual Information Based Feature Selection Method For High Dimensional Small Sample Data
8	Research On Dynamic Feature Selection Algorithm Based On Mutual Information
9	Improvement On Mutual Information In Feature Selection Based On Composite Ratio
10	Research On Mutual Information Based Feature Selection Algorithm