Study On Feature Selection Reselected By Term Frequency In Text Classification

Posted on:2017-01-07

Degree:Master

Type:Thesis

Country:China

Candidate:M He

Full Text:PDF

GTID:2308330485471030

Subject:Library and Information Science

Abstract/Summary:

PDF Full Text Request

Text classification, one of data mining techniques, is a key technique on handling and organizing mass information. Using text classification, texts can be classified fast and correctly, which can organize information effectively and improve efficiency of using information. Text classification not only has important theoretical significant, but also has high practical value. There are some procedures of text classification, including text preprocessing, feature selection, classification model training, performance evaluation. These produres and problems on imbalanced datasets are studied intensivly by researchers, and they have also achieved abundant research results, but deficiencies still exit. This thesis studies on the procedure of text classification and some classification algorithms, including Naive Bayes, kNN, SVM. One of the important fields of text classification is feature selection, used to reduce super high dimensionalities to get a feature subset with relatively less dimensionalities. The subset will help improve the performance of text classifiers. This thesis also studies on some classical feature selection algorithms (Document Frequency, Information Gain, Mutual Information, Chi-square Statistics). In order to enhance performance of text classifier on imbalance dataset, this thesis provides a new feature selection method, which is based on reselected by term frequency. The main idea of the new feature selection method is that using classical feature seletion methods to setect initial feature subset first, then selecting from initial feature subset by term frequency based on classes. The final feature subset is made of features selected by the second step. This method has been proved its effectiveness on dataset Reuters-21578, by using IG, CHI, and MI on Naive Bayes classifier, kNN classifier, and SVM classifier.

Keywords/Search Tags:

Feature Selection, Text Classification, Imbalanced Dataset, Dimensionality Reduction

PDF Full Text Request

Related items

1	Dimensionality Reduction On LC-MS Dataset
2	Research On Feature Selection And Weighting Method For Chinese Text Classification
3	Research And Implementation Of Feature Selection In Chinese Text Classification
4	Text Categorization And Feature Dimension Reduction Research
5	Improvement Of KNN And Its Application To Text Classification
6	Research And Implementation Of Feature Selection In Chinese Text Classification
7	Research On High Performance Chinese Text Classification Based On Machine Learning
8	Research On Imbalanced Text Classification
9	Research On Feature Dimensionality Reduction And Text Classification Method Based On Multi-label Leaming
10	Research On Text Classification Model And Algorithm For Small Dataset