Font Size: a A A

Research On Parallel Classification Algorithm Of Streaming Data

Posted on:2016-06-05Degree:MasterType:Thesis
Country:ChinaCandidate:Y H WeiFull Text:PDF
GTID:2208330464463535Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
In recent years, with the development of the mobile internet technology, people have entered into the era of big data as the amount of global data growing with each passing day. The information storm caused by the big data is transforming our life style,work style and the mode of thinking. The traditional data mining technology also meet great challenges because of the reaching of the big data era, in which the biggest challenge is the change of data form and structure, and the form of data processed from traditional static data to massive dynamic data. Steam data is the most typical data form in big data, with the features of massiveness, real-time and time-variation, which greatly increase the complexity of the mining algorithm. Therefore, it has been become a hot topic of academic research how to design a classification algorithm which can adapt the stream data’s features, can solve the problems of stream data classification effectively, and can mine new knowledge. This thesis starts from the characteristics of stream data, and focuses on parallelization issues about classification of stream data existing concept drifts.For the concept drift leading classification algorithm to inefficient and low accuracy, this thesis, according to the basic features of stream data and, taking BP neural network as the base classifier, has done the three following aspects of research work:(1) On the basis of analyzing the recent year’s research work, characteristics and causes of concept drift, its definition is described, and common detection methods and handling mechanisms are summarized. Then, aiming at stream data’ requirement of real-time for classification, we propose a method that determines the occurrence of concept drift by the Euclidean distance. Then, retraining and updating mechanisms are described when the concept drift is detected.(2) In order to solve the problem that classifier cannot quickly update the model after detecting concept drift, based on incremental learning theory, an incremental BP neural network based classification algorithm for concept drifting data stream, named IBPNN-CDCA, is proposed, in which the model can retain prior knowledge and dynamically update neurons weights between nodes through incremental learning, then avoids to retrain the classifier model. Thereby, the BP neural network can quickly adapt to data changes.(3) Considering about the massive attribute of stream data, on the basis of research on methods about parallel computing with cluster, we propose a parallelized IBPNN-CDCA algorithm based on Spark Streaming. This parallelized algorithm can classify stream data by using computing powers of the entire cluster and can hold high throughout meanwhile.In summary, aiming at steam data’ features of massiveness, real-time and time-variation, we use the advantage of that parallelization can improve the data throughput, propose and design the incremental BP neural network classification algorithm for concept drifting data stream, and parallelized ones separately. By the advantage that incremental online learning can adapt to concept drift caused by real-variation, the above two algorithms are not only ensuring the accuracy, but also reducing time consumed by model update and improving the efficiency of classification. Experimental results show that the two IBPNN-CDCA algorithms, compared with some other concept drift classification algorithms, such as CVFDT, CDRDT and MSRT, have better ability of anti-drift and higher accuracy of classification. The research work of this thesis provides a new approach for real-timely classifying concept drifting stream data, and has a certain reference value for future research on the classification problem of stream data.
Keywords/Search Tags:BP Neural Network, Classification, Concept Drift, Stream Data, Data Mining
PDF Full Text Request
Related items