Font Size: a A A

The Research On RLS-MARS Feature Selection For Text Classification

Posted on:2009-06-23Degree:MasterType:Thesis
Country:ChinaCandidate:X LiFull Text:PDF
GTID:2178360272980869Subject:Computer system architecture
Abstract/Summary:PDF Full Text Request
As the volume of information available on the Internet and corporate intranets continues to increase, there is a growing need for tools helping people better organize, store, and access these resources. As one of main tools, the Text Classification (TC) is a key technology that organizes and processes large amount of document data through a process that automatically sorts a set of documents into categories from a predefined set. It can also solve the problem of information disorder to a great extent, and is convenient for users to find the required information quickly.The higher dimension feature space is one major problem in the Text Classification. The basic terms in a feature space are words and phrases. Even in a moderate-size document collection, the feature space may have tens of hundreds of terms. Many standard classification techniques can not deal with such a large feature set. Large number of features will result in"the curse of dimensionality", and"over study", and classifiers'performance degradation. Therefore, it is necessary to reduce the original text description space if the accuracy of features is not affected. Feature Selection and Feature Extraction will help remove noisy features, and reduce the dimensionality of text data sets. Feature Extraction projects the original feature space onto a lower dimension space and creates new (extracted) features, which are often linear or non-linear combinations of the original features. Feature Extraction is helpful in solving the problems related to synonymy and polysemy, but it is difficult to provide a direct semantic interpretation to the new features. Feature Selection marks and ranks each original feature by using some estimate functions, and then selects the items with higher ranks.The main purpose of Feature Selection is to select a subset with a lower dimension space from the original feature space. The subset represents the original feature space. This paper presents an efficient method for feature selection, that is, the Regularized Least Squares - Multi-Angle Regression and Shrinkage (RLS-MARS), using a combination of the Efron's Least Angle Regression method and the Regularized Least Squares method. The method selects the ordering features in a multi-dimensional feature space based on the directions along which the feature gradient matrix changes and modulus values of the feature gradient matrix decrease. The RLS-MARS method considers the relationship among latent variables, and selects more distinct and effective features from the original feature set. The selected features have the same distribution of the original set but they have no or little correlation.The RLS-MARS Feature Selection technique, a main method for feature selection in a lower dimension, determines the kernel features from a multi- dimensional space based on feature's properties. It consists of following steps: (1) calculate the relative minimum angles from those vector angles; (2) determine the current gradient decreasing direction of vectors and re-compute the gradient values of vectors; and (3) select the fitting variable of the estimation function being used for selection of rational features.Experiments on Reuter-21578 corpus showed that investigation on the estimated F1 values for both keeps and leaves-out of the L2 regularizer in each class indicates that keeps of the L2 regularizer have better results than leaves-out of the L2 regularizer under both asymmetry and even class distribution as dimension increases. The proposed method captures the semantic information of the categories and performs more effectively thanĪ‡2 statistics on several classes.
Keywords/Search Tags:Text Classification, Feature Selection, RLS, LARS, RLS-MARS
PDF Full Text Request
Related items