Font Size: a A A

Classification Of Imbalanced Sample Based On Stream Data

Posted on:2015-10-11Degree:MasterType:Thesis
Country:ChinaCandidate:D ZhaoFull Text:PDF
GTID:2308330479489719Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the rapid development of information technology, there are more and more data forms now and streaming data is one of them. The streaming data form is different from traditional data form in the features of mass, read-time and dynamic change. In addition, the data is always imbalanced in the real applications such as judging financial fraud from credit card transaction records, predicting disease from medical check-up data and so on.In solving the imbalanced data, the main idea of algorithm SMOTE is to increase minority class samples by finding nearest minority class point to linear interpolate to generate new samples. Algorithm REA take a method called sliding window to train classifier in a period of time to solve the problem of classifying imbalanced sample based on stream data and finally get a classifier. The two algorithms before have advantages and disadvantages in each one. Algorithm AMOTE doesn’t consider the distribution of minority class samples in different areas and it cannot control the position of new samples generated. Besides, algorithm REA doesn’t solve concept drift and small disjuncts in an effective way.This thsis proposes an improved algorithm called CSMOTE_REA based on traditional REA and SMOTE algorithms in order to solve the problem of classifying imbalanced sample based on stream data. This algorithm uses a sampling method with clustering feature. The method first adds historical data to add minority class point count and then clusters minority class point to recognize them in different areas. At the same time, the thsis proposes a method which generates samples based on grid to lead the generated data has strong relationship with minority before and improves the degree of polymerization in minority. Besides, the paper also proposes a method for the test samples to choose their classifications by themselves, which improve the capability of classifier to predict. Through experiments comparing to other algorithms on many data sets, the algorithm shows a better performance on the problem of classifying imbalanced sample based on stream data.
Keywords/Search Tags:streaming data, imbalance data, resampling, ensemble learning
PDF Full Text Request
Related items