Font Size: a A A

A Technology Of Text Categorization On Imbalanced Datasets

Posted on:2010-08-13Degree:MasterType:Thesis
Country:ChinaCandidate:Y YuFull Text:PDF
GTID:2178360302461495Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
How to organize and manage the mass of information effectively becomes one of the hot spot topics.As we know the text automatic classifying is the focus and core technology of information retrieval and data mining domain research. However, in practical application, some kinds of text are many, while others are few in text samples. In these samples, people usually interest the texts that don't present usually but very important. This is categorization imbalanced of text which is a universal problem in our life.But traditional approaches usually lead a low recognition rate, how to effectively improve the classification performance of a small number of categories has become the field problem of machine learning and pattern recognition to be settled urgently. Therefore, the work in this paper is a challenging pattern recognition problem of great practical importance.This paper aims at improving the categorization performance of the small number of samples in the imbalance datasets, and dealing with data re-sampling from the perspective of data. We used the method of random sampling to improve the generalization performance of the Classifier on the imbalance data sets, that is, we do Pretreatment on the training sets of text,and then train the classifier by the data sets which have been processed before. We proposed an improved over-sampling methods, and in a small number of categories we extracted an arbitrary number of text paragraphs, after that, we added the paragraphs be extracted to the their original catagories, so a small number of new samples were synthesized. The main idea is to make the number of various types of texts by increasing some texts.The experiment indicates that the system has improved the accuracy of text-categorization effectively.
Keywords/Search Tags:Text categorization, Imbalanced datasets, Text feature, Classifier
PDF Full Text Request
Related items