Font Size: a A A

Improved CHI Method On Text Feature Selection

Posted on:2018-12-17Degree:MasterType:Thesis
Country:ChinaCandidate:A ChenFull Text:PDF
GTID:2428330572452512Subject:Software engineering
Abstract/Summary:PDF Full Text Request
Feature selection is the basis of text classification research field,so the performance of feature selection method directly affects the accuracy and effect of text classification.In this paper,the chi-square test method can not calculate the relevance of words and categories,so the low-frequency words and low-level correlation words are weakly recognizable and the resulting feature sets are redundant,resulting in poor classification of defects,Method to improve.First,the correlation between the words and the categories is calculated by calculating the product values ??of the probability of the occurrence of the words and the documents belonging to a particular category and the reciprocal of the probability values,and the words with strong relevance are selected to reduce the low frequency words and low correlation The interference of the word words to the feature set.Secondly,by calculating the probability of occurrence of two words and the probability of simultaneous occurrence of two words,the similarity between two feature words is calculated,and the feature words with high similarity are reduced and the feature sets are improved.Representative,can be better for text classification services.Finally,the support vector machine(SVM)classifier is used to verify the classification results of CHI method,improved CHI method and IG method.The improved chi-square test method has improved the classification effect before the improvement.
Keywords/Search Tags:Feature selection, CHI, Similarity calculation, Redundancy reduction, Relevance calculation
PDF Full Text Request
Related items