Font Size: a A A

Based On CHI And Feature Clustering Text Feature Reduction

Posted on:2016-07-14Degree:MasterType:Thesis
Country:ChinaCandidate:C F LuoFull Text:PDF
GTID:2308330479494836Subject:Software engineering
Abstract/Summary:PDF Full Text Request
With the rapid development of the information age, electronic information on the Internethas become more and more. How to quickly and efficiently get the information we need hasbecome a hot research topic. Text classification and clustering as a key technology for theprocessing of text mining. As large amounts of text data processing technology, It often caneffectively improve retrieval efficiency. How to select the most useful features fromhigh-dimensional space is a characteristic research of feature dimension reduction technique.Feature selection is based on the evaluation function or search algorithm to select a number offeatures in the original feature space constitute a subset of features.This paper fistly describe present situation of feature dimension reduction technique.Then some text mining technologies are described, which set the stage for later wrote.Effective feature dimension reduction method, not only can effectively reduce the dimensionof feature space, and can remove useless for classification of irrelevant features, and ffectivelyimprove the classification accuracy and efficiency of classification algorithm. Based on textclustering, this paper combines the advantages of the chi-square statistic, this paper proposesa feature selection method based on the characteristics of semantic clustering CHIFC.In order to verify the effectiveness and feasibility of the proposed approach, we use naiveBayes classifier and support vector machine classifier on Sogou corpus and the ChineseAcademy of Sciences Institute of Automation corpus to compare the proposed method withtraditional statistical methods, document frequency method. Experimental results show that heproposed method’s macro F1 value has a smaller difference between the conventional methodCHI in an order of magnitude of the dimension reducing conditions. And superior documentfrequency method. Experimental results show that this method can greatly reduce thedimension of feature space for different types of corpora, and to maintain a good classificationefficiency, which verifies the proposed method is feasible and effective.
Keywords/Search Tags:Feature Clustering, Feature Selection, Text Categorization, χ2 statistics
PDF Full Text Request
Related items