Based On CHI And Feature Clustering Text Feature Reduction

Posted on:2016-07-14

Degree:Master

Type:Thesis

Country:China

Candidate:C F Luo

Full Text:PDF

GTID:2308330479494836

Subject:Software engineering

Abstract/Summary:

PDF Full Text Request

With the rapid development of the information age, electronic information on the Internethas become more and more. How to quickly and efficiently get the information we need hasbecome a hot research topic. Text classification and clustering as a key technology for theprocessing of text mining. As large amounts of text data processing technology, It often caneffectively improve retrieval efficiency. How to select the most useful features fromhigh-dimensional space is a characteristic research of feature dimension reduction technique.Feature selection is based on the evaluation function or search algorithm to select a number offeatures in the original feature space constitute a subset of features.This paper fistly describe present situation of feature dimension reduction technique.Then some text mining technologies are described, which set the stage for later wrote.Effective feature dimension reduction method, not only can effectively reduce the dimensionof feature space, and can remove useless for classification of irrelevant features, and ffectivelyimprove the classification accuracy and efficiency of classification algorithm. Based on textclustering, this paper combines the advantages of the chi-square statistic, this paper proposesa feature selection method based on the characteristics of semantic clustering CHIFC.In order to verify the effectiveness and feasibility of the proposed approach, we use naiveBayes classifier and support vector machine classifier on Sogou corpus and the ChineseAcademy of Sciences Institute of Automation corpus to compare the proposed method withtraditional statistical methods, document frequency method. Experimental results show that heproposed method’s macro F1 value has a smaller difference between the conventional methodCHI in an order of magnitude of the dimension reducing conditions. And superior documentfrequency method. Experimental results show that this method can greatly reduce thedimension of feature space for different types of corpora, and to maintain a good classificationefficiency, which verifies the proposed method is feasible and effective.

Keywords/Search Tags:

Feature Clustering, Feature Selection, Text Categorization, χ2 statistics

PDF Full Text Request

Related items

1	The Research Of Text Representation And Feature Selection In Text Categorization
2	X ~ 2 Statistics-based Chinese Text Categorization Feature Selection Method
3	Research And Implementation On Web Chinese Text Categorization Technology
4	Theoretical Analysis And Algorithm Study On Feature Selection For Text Categorization
5	Related Technologies Research On Feature Selection For Text Categorization
6	Normal Weight Based Feature Selection Method In SVM Text Categorization
7	An Improved Approach To Feature Selection Of Chinese Text Categorization Based On Correlation Grouping Principle
8	A Study On Key Issues Of Automated Text Categorization For Chinese Documents
9	Feature Selection Methods For Text Categorization
10	Research On Text Categorization And Technologies