Font Size: a A A

Based On Data Distribution Characteristics Of Text Classification

Posted on:2012-04-12Degree:MasterType:Thesis
Country:ChinaCandidate:H G XuFull Text:PDF
GTID:2218330368989682Subject:Computational Mathematics
Abstract/Summary:PDF Full Text Request
With the rapid development of Internet, the Internet is becoming an indispensable information media. The amount of information (on the Internet, digital libraries, information resources, and so on) increasing exponentially and a large number of machine learning problems (such as document retrieval, image matching, weather forecasting, intrusion detection and gene engineering) emerging, it urgently needs faster and more effectively getting the goal information for facing the challenges of information. Dealing with the issue using traditional manual classification method has clearly become short-handed, so computer becomes the best choice.Currently, the text classification becomes a hot area, it has important research value and practical significance. In text categorization, although many methods are put forward, and some has been relatively mature and classification effect is good, but practical classification technology is still lacking. The complexity of many classification model and feature selection algorithm is higher, realization process is too complex to low efficiency of training and classification, it is an important task to find the need information from huge data set. How to find the wanted data from the mass of the target information is a crucial text classification task. How to improve the accuracy and efficiency (time complexity, space complexity) of text classification is a project which needs deeper consideration.Feature selection methods and classification algorithms are cores of text categorization, majority of researchers are dedicated to the exploration and improvement of them. The application of new methods has achieved good results. In short, it is a very necessary and promising research.This paper studies the unbalanced set data from the perspective of the feature selection methods and classification, roughly as follows:(1) Based on the improved type distribution feature selection methodWe found that feature selection method based on idea of Fisher should considers both inside-class and between-classes variance, so that it can get a characteristic of having better ability to distinguish between categories. In order to achieve this, the author proposes a feature selection method based on improved category distribution. It is also proved by experiments that the improved method is effective.(2) Using Drag-Pushing strategy on unbalanced Chinese textFrom the point of classifier, The aim to demonstrate drag-pushing strategy is better than the SVM, KNN. This paper firstly introduces the knowledge of traditional classification methods, and puts forward the problems of them in unbalanced data sets. and then present the drag-pushing to solve the problem, finally using IG+drag-pushing, IG+SVM, IG+KNN to experiment on unbalanced data sets. The results indicate that IG +drag-pushing has much better performance than the rest. So it fully states the efficiency of the method.
Keywords/Search Tags:Unbalanced data set, Feature selection, Text classification, Drag-Pushing algorithms, Machine learning
PDF Full Text Request
Related items