Font Size: a A A

Research On Feature Selection Algorithm For Text

Posted on:2013-11-21Degree:MasterType:Thesis
Country:ChinaCandidate:R J YangFull Text:PDF
GTID:2268330395979606Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Text classification Technology classified the massive unclassified information on the internet via previously defined category, to improve the utilization rate of information and verify the value of information for researchers. Text classification is a process which contains text corpus, text pretreatment, text presentation, character dimensionality reduction, structure classifier and test. Among them, the character dimensionality reduction is the most important one of the six steps. In improving the accuracy of classification, saving space, reduce the calculation time-complexity, dimensionality reduction play a very important role.There are two major ways to reduce character dimensionally:feature extraction and feature selection. Feature extraction abstracts a new feature subset by the relationship of the mapping function from the original features.The characters between features are more independent in this specific character dimension, and it’s much easier to identify various texts. Feature selection, among the total features cluster, selects the more designable features as the unique subset of classified training by using the Feature weights calculation formula with various features.Depend on different perspectives,feature selection could be defined differently, among them, there are two ones named "under supervision" and "no supervision" which could be identified by whether it get the class marker.This paper includes two algorithms:First of all, mRMR-ReliefF feature selection algorithm, which is based on ReliefF algorithm.The algorithm takes probability instead to make up the lack of differences of the feature measurement, and come up with a new difference function.The features extracted from the function makes the correlation within class and the difference between classes more significant.The algorithm also considers the word correlation.Word correlation is in consideration of not only choosing the word with close relation to the characteristic ones but also elimination of characteristic redundant. Through the comparison experiments of the three algorithms, the result verify that the test algorithm provides a more effective feature to text classification in this paper.Secondly improvement for conventional IG algorithm:a text feature selection method (TDpIG) based on information gain. Selecting data cluster by feature category to reduce the influence of unbalance cluster for the feature selection. Second step, by using features-appearing probability calculate gain value of information to lower the disturbance of low frequency words for feature selection. Finally, analysis characteristic in the value of information gain via the probability of discrete degrees, filter the relatively redundant features of high frequency words, and refine the difference in information gain of selected feature application to get uniform accurate feature subsets. According to the contrast experiment, indicate that selected feature is a better way for text classification.The two feature selection algorithms referred in this paper all can be concluded as the feature selection which is under supervision. And the algorithms perfect the feature selection methods to raise the accuracy of algorithm, improve the quality of feature selection and get a clearer result.
Keywords/Search Tags:Feature selection, mRMR-ReliefF algorithm, information gain, redundantfeature, Text classification
PDF Full Text Request
Related items