Font Size: a A A

Research On Imbalanced Text Classification

Posted on:2015-02-18Degree:MasterType:Thesis
Country:ChinaCandidate:H J YangFull Text:PDF
GTID:2298330467963784Subject:Computer technology
Abstract/Summary:PDF Full Text Request
With the rapid development of WEB2.0and other Internet technologies, Internet texts are now largely free from restriction in content and structure. This reality brings new challenges for text classification, which includes the imbalance text classification. Imbalanced text refers to such text where discrepancy lays between different classes. The performance of traditional text categorization methods, especially the categorization performance for minority classes, often deteriorates dramatically when training set is imbalanced. Actually, categorization performance of minority classes is much more important than categorization performance for majority classes in many applications of imbalanced text classification such as identification of illegal web pages or Junk Mail.Based on the study of existing imbalanced text classification methods, this paper has completed the following tasks in order to solve problems presented above.1. The design of an imbalanced text classification method based on synonyms expansion.This method is one of the minority-compensating methods which functions like data over sampling methods. Unlike traditional over sampling method, this method concentrates on the clustering-representation process of feature space by synonyms vector. Supported by linguistic rules and statistical laws in synonyms, this method implements the feature-prediction and feature-compensation process. The experimental results show the categorization performance is improved with this method.2. The design of a new synonym dictionary generating method which is based on a thesaurus which is named TongYiCi CiLin.This thesaurus is developed by Harbin Institute of Technology Center for Information Retrieval (HIT-CIR). This synonym dictionary generating method makes sure that the new dictionary is context-adaptable. At the same time, method provide precise control of dictionary dimension.3. An expanding rule and a expansion judging method are proposed.By left-side expanding rule and feature pre-selection method, the issue of boundary decision is smoothly solved.4. The design and implement of imbalanced text classification system.Combined with ordinary text classification ability, the classification system can also deal with imbalanced text. The system provides a variety of feature selection methods and classification algorithms. Users can make their own classification strategies with config file.
Keywords/Search Tags:text classification, imbalanced text, synonymdictionary, feature Selection, performance evaluation
PDF Full Text Request
Related items