Font Size: a A A

Research On Classification Of Data Stream With Recurring Concept Drift

Posted on:2017-04-15Degree:MasterType:Thesis
Country:ChinaCandidate:C FengFull Text:PDF
GTID:2348330509455401Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
In the era of big data, data generated as stream is very common, such as data generated by sensors, browsing and purchasing records generated by shopping website users, constantly changing social networks generated by social website users, and so on. Concept drift often occurs in data stream which makes the traditional classification methods not applicable to data stream. Concept drift constitutes a challenging problem for data stream mining, recurring concept drift is one of the sub-types of concept drift. Because of the high speed and large volume of data stream, it's hardly possible to obtain label for every instance in real-world applications which makes many instances unlabeled.For the issues of recurring concept drift and missing labels that frequently appear in data stream classification, the following contributions have been made:(1) In the case of recurring concept drift detection, it is very important to represent concepts and select the most appropriate classifier to classify. To deal with these issues,an algorithm for classifying text data streams with recurring concept drifts has been proposed. It can recognize recurring concepts by computing the differences of main features and impact factors of different batches of instances. And it maintains a classifier for each concept and monitors the classification accuracy to select classifier according to hoeffding inequality in order to enhance the ability of adapting to concept drift. The experimental results illustrate that the algorithm proposed achieves better classification accuracy, adapts faster to concept drift, and detects concept drift more accurately than the other four algorithms on the data stream with recurring concept drift, and it's also apt to classify data stream without recurring concept drift.(2) A classification algorithm for partially labeled data stream with recurring concept drift has been proposed. The algorithm detects recurring concept drift by monitoring classification accuracy. The detection threshold adjusts automatically according to the classifier's generalization performance, which can reduce the risk of making wrong judgments and avoid setting threshold manually. The algorithm labels the unlabeled data by semi-supervised classification method,which can increase the number of labeled data and thus be able to improve the generalization performance of classifiers. To improve the labeling accuracy, the concept-specific classifiers of historical concepts are introduced in to assist semi-supervised classification. The experimental results illustrate that the algorithm can accurately determine whether two concepts are the same and thus be able to make use of recurring concept to improve the responding speed to concept drift, and as a consequence, can minimize significantly the negative impact on classification accuracy caused by concept drift. It can also be seen from the experimental results that making use of historical classifiers to assist semi-supervised classification can improve labeling accuracy significantly and as a result can greatly improve the generalization performance of classifiers.
Keywords/Search Tags:data stream, concept drift, semi-supervised classification, recurring concept drift
PDF Full Text Request
Related items