Font Size: a A A

Study On Feature Selection Reselected By Term Frequency In Text Classification

Posted on:2017-01-07Degree:MasterType:Thesis
Country:ChinaCandidate:M HeFull Text:PDF
GTID:2308330485471030Subject:Library and Information Science
Abstract/Summary:PDF Full Text Request
Text classification, one of data mining techniques, is a key technique on handling and organizing mass information. Using text classification, texts can be classified fast and correctly, which can organize information effectively and improve efficiency of using information. Text classification not only has important theoretical significant, but also has high practical value. There are some procedures of text classification, including text preprocessing, feature selection, classification model training, performance evaluation. These produres and problems on imbalanced datasets are studied intensivly by researchers, and they have also achieved abundant research results, but deficiencies still exit. This thesis studies on the procedure of text classification and some classification algorithms, including Naive Bayes, kNN, SVM. One of the important fields of text classification is feature selection, used to reduce super high dimensionalities to get a feature subset with relatively less dimensionalities. The subset will help improve the performance of text classifiers. This thesis also studies on some classical feature selection algorithms (Document Frequency, Information Gain, Mutual Information, Chi-square Statistics). In order to enhance performance of text classifier on imbalance dataset, this thesis provides a new feature selection method, which is based on reselected by term frequency. The main idea of the new feature selection method is that using classical feature seletion methods to setect initial feature subset first, then selecting from initial feature subset by term frequency based on classes. The final feature subset is made of features selected by the second step. This method has been proved its effectiveness on dataset Reuters-21578, by using IG, CHI, and MI on Naive Bayes classifier, kNN classifier, and SVM classifier.
Keywords/Search Tags:Feature Selection, Text Classification, Imbalanced Dataset, Dimensionality Reduction
PDF Full Text Request
Related items