Text Classification Algorithm Based On Imbalanced Data Sets

Posted on:2014-02-06

Degree:Master

Type:Thesis

Country:China

Candidate:N N Xie

Full Text:PDF

GTID:2268330392972275

Subject:Computer application technology

Abstract/Summary:

As the fast development of computer network technology, electronic documentsgradually become a main style of text information. The diversity and bad organizationof network information make users have difficulty finding the exact information thatthey really want. Text classification which is considered to be the most importanttechnology in information retrieval plays a great role in organizing the documents. Thedata sets in library for text processing are relatively balanced. However, this is differentfrom text categorization collections in practical applications, especially that the texts onthe network are often marked incomplete or imbalanced. The data imbalance problemhas become a major problem of text classification technology due to the broadapplication and its importance in various fields. Besides, text classification onimbalanced data sets is becoming a focus in text mining.In this paper some research has been done on text categorization on imbalanceddata sets. A new text classification algorithm on imbalanced data sets is proposed basedon the improvement of the feature selection in text classification and the re-sampling indata set layer. The main contents of this paper are as follows:â‘ A deep research has been made on the traditional CHI statistical featureselection method and the one-sided metric CHI-square which only considered thepositive feature. However, the experiment result shows that they both give poorperformances.â‘¡Based on the research and analysis of the imbalanced data sets, a newimprovement on the one-sided metric CHI-square method is proposed. A tendentiousfactor is introduced to preserve part of the negative feature which may have acontribution on the classification of small class. Besides, in order to overcome thedefects of the CHI-square, the ICF (Inverse categorization frequency) is also proposedas a factor of the new feature selection method. The features which can most respect thecategorizations are selected by using the new method. Then, the texts of corpus arequantified to the vector space mode.â‘¢In order to solve the inefficient classification result because of the imbalanceddata, a re-sampling process is made on the data layer after the quantification of the textcorpus. First, a re-sampling method which is based on the combination of randomoversampling and random under-sampling is applied. Though it has better achieved filtering the imbalance of the data distribution and give a relatively balanced data setwhich is used to train a classifier. The random oversampling always lead to over-fittingin classification, while the random under-sampling canâ€™t avoid to delete some sampleswhich play important role in classification which may produce the reduction of theclassification result. So an improvement of the combined re-sampling method isproposed by using the SMOTE on oversampling which often behaves well and theunder-sampling method based on improved clustering algorithm. The experiment resultsshow that the new method has produced a better classification result.

Keywords/Search Tags:

Imbalanced data sets, text classification, CHI-square selection method, data distribution, re-sampling

Related items

1	Research On Classification Method Of Imbalanced Data Sets
2	Camplaints Text Classification Research Of Imbalanced Data Sets
3	Research On The Classification Of Imbalanced Data Sets Based On R-SMOTE
4	Research On Imbalanced Data Sampling Methods For Text Sentiment Classification
5	Research On The Classification Of Imbalanced Data Sets And Related Problems
6	Research On The Classification Algorithm Of Imbalanced Data Sets
7	Research On Classification Method Of High-dimensional Class-imbalanced Data Sets Base On SVM
8	Data Distribution-driven Adaptive Hybrid Sampling Method For Imbalanced Data Processing
9	Research On Binary Imbalanced Large Data Classification And Its Application
10	Comprehensive Oversampling And Undersampling Study Of Imbalanced Data Sets