Font Size: a A A

Hybrid Text Feature Selection Method Based On Word Frequency And Word Position

Posted on:2021-01-06Degree:MasterType:Thesis
Country:ChinaCandidate:G Q TianFull Text:PDF
GTID:2428330620961350Subject:Software engineering
Abstract/Summary:PDF Full Text Request
Text classification is an important research field in text mining.Generally,the number of features in text datasets is much larger than the number of samples.Effective feature selection can greatly improve the performance of classification.The text feature selection strategies based on Bag-of-Words mainly include filter,wrapper and hybrid.Filter methods are adaptable and spends short time.Wrapper methods have good performance but take long time.It is difficult to apply wrapper to text datasets.Hybrid method can combine the advantages of filter and wrapper,with a better performance of classification and a shorter time cost.Hybrid text feature selection method is studied in this thesis.The word frequency and word position are applied to the filter stage of the method.The global discriminant information of words is fully used to guide the search in the wrapper stage.Main tasks are as follows:(1)Ability of Category Representation computing method based on word frequency and word positionThe feature selection strategies based on the word position usually determine the position weight of words by experience,which lacks the factual basis.Thesis proposes a kind of word category representation ability,which combines word frequency and word position to select features.Firstly,calculate the occurrence position of all high weighted words in the corpus to get the position weight factor of words.Then,weighted sum the word frequency in documents according to the weight factor,get the weighted word frequency of different categories.Finally,normalize the weighted word frequency of different categories,compute the Ability of Category Representation based on word frequency and word position.(2)Feature selection method based on Ability of Category Representation and Information GainThe Information Gain only considers whether the features appear in the text or not,and not the impact of word frequency.Feature selection method based on Ability of Category Representation and Information Gain is proposed as the filter stage.Firstly,the expression of Information Gain is improved.Add the Ability of Category Representation based on word frequency and word position when considering the change of system entropy.Then,select the non-dominated features in feature set by the multi-objective optimization.Finally,select features with high score among the remaining features.The optimal cut ratio is determined by experimental verification.(3)Feature selection method based on Global Discriminant Information and Genetic AlgorithmThe traditional genetic algorithm does not consider the category discrimination of a single feature in the iteration.A feature selection method based on Global Discriminant Information and genetic algorithm is proposed as the wrapper stage.Firstly,a selection strategy is proposed to retain elites combining with the idea of multi-objective optimization.Then,construct the fitness function according to the evaluation indicator of text classification and individual information.Finally,the crossover operator is designed based on the global discriminant information of words,instead of random crossover.(4)Hybrid feature selection method based on word frequency and word positionHybrid feature selection method based on word frequency and word position is proposed,in which the feature selection method based on Ability of Category Representation and Information Gain is used as filter stage and the feature selection method based on Global Discriminant Information and Genetic Algorithm is used as wrapper stage.Firstly,some features are removed according to the optimal cut ratio in the filter stage.Retain the Global Discriminant Information of the rest features.Then,the rest features are input to the wrapper stage to get the feature selection results.Finally,extend the features of high weight words in the test documents to enrich the text information.Experiments show that the hybrid feature selection method based on word frequency and word position combines the advantages of filter and wrapper feature selection methods.It improves the performance of text classification and provides a new solution for hybrid feature selection.
Keywords/Search Tags:Word Frequency, Word Position, Information Gain, Genetic Algorithm, Hybrid Feature Selection
PDF Full Text Request
Related items