Font Size: a A A

Research On Online Learning Of Big Data Based On Concept Drift Detection

Posted on:2017-05-20Degree:MasterType:Thesis
Country:ChinaCandidate:Y N LiFull Text:PDF
GTID:2348330512462133Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
With the progress of society and the continuous development of new technology, information is generated constantly, forming a mass of data stream. This is not only to improve the ability of the collection and storage of data, but also to improve the learning ability of the data stream. The information behind the data stream is very important, and there will be very significant when mining the information implicit in the data, using data mining or machine learning algorithms.However, the information or the concept implicit in the data stream may change unpredictable as time goes on, which is called concept drift. It will be a challenge for the learning to classify the data stream because of concept drift, and led to the traditional classification algorithm is not well adapted to the classification of data stream. There are two aspects to classify the data stream with concept drift:containing the detection of concept drift and the construction and update of the classification model.The main content of this paper is online learning of classification to the data streamcontaining concept drift. A classification model based on incremental SVM and a mechanism of two-dimension for the detection of concept drift is proposed. This paper made the following contributions:(1) A mechanism of two-dimension is first proposed for the detection of concept drift., That is the concept drift detection is based on the two dimensions of the property of the data and the classification results.K-means clustering makes use of the data property as defined similarity criteria, and the method based on K-means clustering to detect the concept drift using the data property. The misclassification calculated after the classification of new data is a random variables which is subject to Bernoulli distribution, and the method based on Bernoulli distribution to detect the concept drift according to the classification results. It will enhance the capabilities of detection to concept drift when the method of detection makes use of the data property and the classification results. Considering the result of detection on the two-dimensional of the data property and Classification performance, it can distinguish between noise data and concept drift data to some extent, improving the learning ability of the model in the case eliminate noise interference.(2) A classification model based on incremental SVM and K-means clustering is proposed. This model is based on incremental SVM and K-means clustering combined together. The idea of incremental learning in line with the characteristics of data stream, so that learning on data stream is a gradual process. The result generated from the detection of concept drift can be used as criterion to update K-means clustering model and the classifier, and the instances can not be recognized by K-means clustering are called increment, which will compose one part of new training datasets with old support vector (SV).(3)Making the learning of data stream with concept drift to combine the techniques of big data on Spark. Almost all the machine learning algorithm including Support Vector Machines and K-means require a lot of iteration, in addition, the machine learning framework of MLlib is very conducive to the iterative calculation and the framework integrated K-means clustering. On the other hand, the processing mechanism based on window in Spark Streaming can be combined to the classification model proposed in this paper, making the model carried out conveniently.verifying the classification model for data stream with concept drift on artificial data sets and real data sets, experimental results show that the model proposed in this paper has better classification results and strong ability to detect concept drift, and will have good performance on noise data, lining with expectations aims.
Keywords/Search Tags:Concept Drift, Data Stream, Incremental Learning, Spark, Big Data, SVM
PDF Full Text Request
Related items