Font Size: a A A

Research Of Feature Selection Based On Comprehensive Measure In Text Categorization

Posted on:2016-02-23Degree:MasterType:Thesis
Country:ChinaCandidate:B X LiFull Text:PDF
GTID:2348330479953417Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Text categorization is a traditional and timeless topic, today the internet technology is developing increasingly, and it's applied more and more widely. Since the traditional text representation is of high dimensionality and high sparsity in text categorization, feature selection is especially significant.When studying text categorization, we found that the word frequency feature selection algorithm ignores the inter-category significance and intra-category dispersion, chi-square feature selection algorithm considers only the inter-category significance. Then we proposed three feature selection algorithms based on comprehensive measure: word frequency feature selection based on balance factor, chi-square feature selection based on balance factor and a filter-based chi-square feature selection. The first two algorithms effectively solve the word frequency feature selection and chi-square feature selection algorithms' deficiencies by the introduction of the balance factor, which linearly combines the inter-category significance with intra-category dispersion, by modifying the balance factor to adjust the contribution degree of the inter-category significance and intra-category dispersion. Filter-based chi-square feature selection results in an efficient subset of features by excluding the features which are in the traditional chi-square feature selection result set and whose intra-category dispersion are lower than a given threshold.In order to test and verify the improvement of the upper three feature selection algorithms, we design and implement a text categorization system which includes Multinomial Naive Bayes, Support Vector Machine and k-Nearest Neighbor categorization algorithms. The final result shows that these three feature selection algorithms are feasible and effective and have better universality.
Keywords/Search Tags:text categorization, feature selection, inter-category significance, intra-category dispersion, comprehensive measure
PDF Full Text Request
Related items