Font Size: a A A

Theoretical Analysis And Algorithm Study On Feature Selection For Text Categorization

Posted on:2008-02-03Degree:MasterType:Thesis
Country:ChinaCandidate:X C XiongFull Text:PDF
GTID:2178360242494009Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
In the past decades, with the growth and development of internet, the quantity of data that can be freely retrieved has increased very fast. Most of the data is in the format of text. Text Classification, also known as Text Categorization, has become an important research area in data mining and text mining. In the past few years, a lot of research algorithms and results appeared in the literature. However, with the rapid change of internet and application requirement, many text classification methods and text processing methods encountered difficulties, such as large-scale data, too many features, too much computing time, much noisy data, low learning performance, and so on. Feature is the basic processing and analysis unit of data mining. Theoretically, as increase of number of features, the computation complexity grows exponentially. Thus, dimension reduction has become an important research problem in recent years.Dimension Reduction helps to improve processing capability of learning algorithm, speed up learning procedure, reduce storage space, and comprehend learning model. Through extraction of most important features, feature selection could help users comprehend data and relationship of data. There are two categories of dimension reduction methods: Feature Extraction and Feature Selection. Feature Extraction constructs a set of new features from original features. This set of new features could represent the characteristics of data better, be easier to do learning, and get better learning performance. However, the problem with feature extraction is that, its computation cost is much larger, e.g., computation time of Principle Component Analysis (PCA) is O ( n3). Feature extraction is more appropriate for data or application with small number of features. But if the number of features become large, it is unacceptable. Another category of dimension reduction method is feature selection. Feature selection does not change the original feature set, but also directly selects a subset of features as the new feature set. One of the advantages of feature selection methods is fast computing, e.g., O (n) for filter methods, and O ( n2)for wrapper methods. Thus, feature selection methods are more appropriate for processing large-scale dataset, such as text dataset, and micro-array dataset, etc.This thesis focuses on research of feature selection algorithm for text classification. After investigating classic text categorization methods, prevalent feature selection methods and feature extraction methods, we propose a graph-based feature selection method. Our method calculates distribution stationary from Markov Chain in each category for each word feature, combines category score by combination model to get an overall score for each word feature, and ranks features by overall score. Then select the high score features to get the selected feature subset. Through model description, it is showed that it make sense to do feature selection based on graph model, and with experiments, it is showed that the graph-based method is almost always better than CHI-square, and is better than the most prevalent feature selection methods IG when reduced dimension is relative low as about 10-1000 features.
Keywords/Search Tags:Feature Selection, Dimension Reduction, Text Categorization, Text Classification
PDF Full Text Request
Related items