Font Size: a A A

Text Classification Algorithm Based On Imbalanced Data Sets

Posted on:2014-02-06Degree:MasterType:Thesis
Country:ChinaCandidate:N N XieFull Text:PDF
GTID:2268330392972275Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
As the fast development of computer network technology, electronic documentsgradually become a main style of text information. The diversity and bad organizationof network information make users have difficulty finding the exact information thatthey really want. Text classification which is considered to be the most importanttechnology in information retrieval plays a great role in organizing the documents. Thedata sets in library for text processing are relatively balanced. However, this is differentfrom text categorization collections in practical applications, especially that the texts onthe network are often marked incomplete or imbalanced. The data imbalance problemhas become a major problem of text classification technology due to the broadapplication and its importance in various fields. Besides, text classification onimbalanced data sets is becoming a focus in text mining.In this paper some research has been done on text categorization on imbalanceddata sets. A new text classification algorithm on imbalanced data sets is proposed basedon the improvement of the feature selection in text classification and the re-sampling indata set layer. The main contents of this paper are as follows:①A deep research has been made on the traditional CHI statistical featureselection method and the one-sided metric CHI-square which only considered thepositive feature. However, the experiment result shows that they both give poorperformances.②Based on the research and analysis of the imbalanced data sets, a newimprovement on the one-sided metric CHI-square method is proposed. A tendentiousfactor is introduced to preserve part of the negative feature which may have acontribution on the classification of small class. Besides, in order to overcome thedefects of the CHI-square, the ICF (Inverse categorization frequency) is also proposedas a factor of the new feature selection method. The features which can most respect thecategorizations are selected by using the new method. Then, the texts of corpus arequantified to the vector space mode.③In order to solve the inefficient classification result because of the imbalanceddata, a re-sampling process is made on the data layer after the quantification of the textcorpus. First, a re-sampling method which is based on the combination of randomoversampling and random under-sampling is applied. Though it has better achieved filtering the imbalance of the data distribution and give a relatively balanced data setwhich is used to train a classifier. The random oversampling always lead to over-fittingin classification, while the random under-sampling can’t avoid to delete some sampleswhich play important role in classification which may produce the reduction of theclassification result. So an improvement of the combined re-sampling method isproposed by using the SMOTE on oversampling which often behaves well and theunder-sampling method based on improved clustering algorithm. The experiment resultsshow that the new method has produced a better classification result.
Keywords/Search Tags:Imbalanced data sets, text classification, CHI-square selection method, data distribution, re-sampling
PDF Full Text Request
Related items