Font Size: a A A

Research On Ensemble Classification Algorithms Of Data Stream Based On Concept Drift

Posted on:2019-05-19Degree:DoctorType:Dissertation
Country:ChinaCandidate:S Q RenFull Text:PDF
GTID:1368330545973659Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the rapid growth of information technology,computer users need to continuously collect and share data,which results in the explosive growth of the data.Data stream is a new type of data.Compared with traditional data,it has the characteristics of high dimension,high speed,dynamics,continuity,and large-scale datasets.Data stream has been widely used in the fields of wireless sensor networks,network traffic monitoring,and financial stocks.These characteristics of data stream make traditional data mining algorithms no longer applicable,and thus data stream mining has become one of the research focuses of data mining.To handle the issues of concept drift,rare sample labels,class imbalance,and complex distribution of data stream,this dissertation carries out the following work:(1)research on concept drift detection and data stream classification algorithms in the environment with rare sample labels;(2)research on data stream classification algorithms to handle concept drift and class imbalance;(3)research on data stream classification algorithms to learn complex distribution and multiple kinds of concept drift in the imbalanced environment.The main work of this dissertation is summarized as follows:1)This dissertation describes the background and significance of data mining and data stream mining,as well as the research status at home and abroad;Meanwhile,the basic concept of data stream and the existing approaches to cope with concept drift are introduced in detail;Then,the traditional data stream ensembles and the ensemble classifiers to handle imbalanced data streams are emphatically introduced;Finally,the evaluation metrics for performance evaluation of data stream classifiers are introduced and common datasets with concept drift are summarized.2)First,data stream classifiers usually work in an incremental manner.They can only get approximate solutions compared to batch processing.Second,in the data stream environment with a large number of unavailable samples labels,the supervised information of data streams is very limited,which is not enough to train a classifier with good generalization ability.Third,the concept-drift detection mechanisms based on the stability of classification performance have long detection delays.Finally,the real-world datasets are generally combinations of many types of concept drift.However,most of the existing algorithms always specialize in only one type of changes.To solve these issues,this dissertation first presents a concept-drift detection mechanism,which can timely discover concept drift based on the supervised and unsupervised information of data streams.Second,this paper presents a hybrid ensemble based on mechanisms of chunk-based ensembles and online ensembles,which can react to multiple kinds of concept drift.Meanwhile,the generalization ability of the candidate classifier can be improved by leveraging unsupervised information and recurrent concepts of data streams.Finally,the useful information of data streams can be made use of by combining the outputs of all base classifiers.3)The existing data stream classification algorithms often assume that the data distribution is balanced or approximately balanced.However,the class imbalance is imbalanced in many real-world applications.To overcome this issue,this dissertation presents a data stream ensemble classifier that can handle concept drift and class imbalance.First,in the chunk-based framework,a resampling mechanism is present.Through evaluating the similarities between each of minority examples in past chunks and the current minority examples,this resampling mechanism selects only those minority examples which are similar to the current concept to re-balance the current class distribution.To avoid the influences of outliers and small disjuncts on the similarity evaluation,the current minority set is first need to be clustered,and then the similarity can be evaluated by calculating the Mahalanobis distance of a past minority example from the minority class cluster.Meanwhile,the similarity between each of past minority examples and the current majority set should be evaluated to avoid the class overlapping issue.Second,a candidate classifier is built over the amplified data chunk.The examples in the latest block are used to update past base classifiers,which makes the ensemble adapt to different kinds of concept drift.Finally,the final decision is derived from all the classification outputs of ensemble members,which can effectively leverage information of data streams.4)Compared with the issue of class imbalance,complex distribution(e.g.,outliers,disjuncts and class overlapping)can seriously degrade the classification performance and complicate the classification task.However,the existing classifiers to handle imbalanced data streams cannot solve the complex distribution issue.To overcome this issue,this dissertation presents a data stream ensemble classifier which can handle concept drift,class imbalance and complex data distribution.In the chunk-based framework,the proposed method first leverages the selectively resampling mechanism to re-balance the current data distribution.This resampling method can avoid absorbing drifting data and complex data into the candidate block.Then,each of the examples in the latest block is assigned an update weight.The past base classifiers are periodically updated to make the ensemble react to different kinds of concept drift.In the periodical update procedure,the costly misclassification examples and minority examples are assigned high update weights and have higher probability to be selected to update past classifiers.
Keywords/Search Tags:Data Mining, Data Stream, Data Stream Mining, Ensemble Classifier, Concept Drift, Unlabelled Samples, Concept Drift Detector, Class Imbalance, Complex Distribution
PDF Full Text Request
Related items