Font Size: a A A

Research On Imbalanced Data Classification In Web Application

Posted on:2017-03-12Degree:DoctorType:Dissertation
Country:ChinaCandidate:H LiFull Text:PDF
GTID:1318330536967203Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
The rapid development of the Internet,especially various Internet applications,such as network news,e-mail,e-commerce and etc.provide a convenient way for people to access to information,but at the same time,people may also be drowned in a sea of information.Automatic classification of these large scale data can help people to access to information more efficiently and make wiser decisions.However,there exists category distribution imbalance in many Internet applications data,i.e.the number of instances in one category or multiple categories significantly less than in other categories,such as reactionary news and normal news,spam and normal mail,abnormal and normal trading transactions and etc.In such case,algorithms and evaluation strategories based on uniform distribution are inclined to overlook the minority categories.In fact,one often concerns more about the minority categories,such as network supervision departments hope to identify reactionary news,mail service providers wishing to better identify spam,e-commerce platforms hope to detect abnormal transactions and etc.Imbalance and related attributes in the Internet applications bring many difficulties and challenges.And thus,research on imbalanced data classification in the Internet applications has strong practical significance and social value.Based on the characteristics of various types of Internet applications and the actual needs of undertaking project,we follow a step from simple to complex and design different methods for different kinds of data.Firstly,we focus on two-class imbalanced data classification and propose new noise filtering algorithm,along with data resample strategies in preprocessing stage.After that,we extend it to multi-class(in which there are more than two classes and one instance can only belong to one class)scene and propose a decomposition strategy combined with data resampling method.Then,we further extend it to multi-label(unlike multi-class,here one instance can belong to multiple classes)scene and propose a new ensemble learning framework along with a new base learning algorithm.Finally,we take actual arrival pattern of Internet application data into consideration and propose a multi-window algorithm for imbalanced stream data classification.(1)In imbalanced two-class classificaiton,we first propose an IPF based noise filtering algorithm to distinguish the minority instances and noise.After that,according to different characteristics of minority and majority instances,we propose a neighbor distribution based oversampling algorithm and a distances based under-sampling algorithm respectively.Acoording to actual application requirements,we then design an adaptive sampling ratio adjust method.Finally,we carry out lots of experiments on real datasets from different domains and the results show the effectiveness of the proposed method,especially for minority instances;(2)For imbalanced multi-class data classification,we propose a divide and conquer strategy.Firstly,OVA is used to partition training set and a series of sub-classifiers are trained.At this point,all the sub-classifiers are trained based on all the training set to ensure the adaptability.Thereafter,OVO is used to further partition corresponding candidate instances and if new partitioned set is imbalanced,it will be sampled.After that,more fine-grained sub-classifiers will be trained on the sampled sets.In addition,according to the actual application requirements,we have designed different strategories for nominal and numeric output respectively.Experiments on a plurality of real datasets show that the proposed method can effectively deal with imbalances that exist in multi-class data;(3)In imbalanced multi-label data classification,we propose an ensemble learning framework,namely i MLEL.Based on Ada Boost,i MLEL integrates distribution imbalance into sub-classifier learning process.In addition,based on multi-label neural network BPMLL,we propsed an improved algorithm especially for multi-label imbalanced data,namely i BPMLL.Finally,we integrate i BPMLL as the base classifier into learning framework i MLEL and test it on practical application datasets.The results demonstrate the effectiveness of the proposed method;(4)For imbalanced stream data classification,we take dynamic characteristics of data stream and uncertainty of instance's arrival into consideration and propose an ensemble multi-window leaning framework,namely MWEL.MWEL consist of four different windows used to save the current sliding window data,the most recent minority instances,the selected sub-classifiers and corresponding historical window each selected sub-classifer trained on respectively.We have designed different update strategy for different windows.For new arrival test instance,its class label will be determined by weighted majority voting strategy.Experiments on multiple synthetic and real world datasets show that the proposed method is more efficient.In summary,this dissertation mainly focuses on imbalanced data classification on different types of data from different Internet applications.A large number of experiments on both synthetic and real world datasets show the effectiveness and effiency of our proposed methods.We can conclude that it is significant to the theoretical research and practical applications on imbalnced data classification in Internet applications.
Keywords/Search Tags:Web application, imbalanced data, classification, resample, ensemble learning
PDF Full Text Request
Related items