Font Size: a A A

Study On The Ensemble Classification Algorithm For Data Streams

Posted on:2013-02-20Degree:MasterType:Thesis
Country:ChinaCandidate:L QianFull Text:PDF
GTID:2248330374998141Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Data mining technology can discover useful information from massive data, Recently, it has been widely used in finance, telecommunications, networking, weather and many other fields. As an important part of data mining, classification has already aroused the widespread interests from a large number of scholars who made a series of excellent researches in the area. However, with the increasing applications of data streams in recent years, classification algorithm facing traditional static database can not adapt to fast, rapid changes, massive, potentially infinite data streams. According to a lot of researches, it has been proved that ensemble method trains mutiple classifiers and votes to select the suitable classifier to label the data. Ensemble enhances the ability of noises and concept drifting immunity, thus improving the accuracy of classification. But original ensemble algorithm limited the performance of classifiers in terms of efficiency and computational consumption due to the particularity of the data streams processing. To solve the problem, this paper presents on demand ensemble through serial optimization and ensemble classification based on cloud computing through parallel optimization.For the problem of high RAM and computation consuming in traditional data stream ensemble classification algorithm, it is proposed an on demand ensemble classification algorithm in this paper, which can revises the number of classifier and their weights on demand actively, so as to achieve the purpose of reducing cost while maintaining high classification accuracy. Accroding to the experiments on two synthetic datasets, both classification efficient and accuracy have been improved in hidden concept drifting data streams, while the memory consumption has reduced significantly.Cloud computing provides cheap and efficient solutions of analyzing and storing mass data, So the study of data stream mining algorithm based on cloud computing which is the most challenging area for massive data mining has important theoretical value and application prospect. According to comprehensive analysis on data streams classification algorithms and the basic theory of cloud computing, it is proposed an ensemble classification algorithm for data streams running on Hadoop framework, and it takes MapReduce parallel programming model to improve traditional dynamic weight-based ensemble, finally speed up classification efficiency. Results show that the algorithm for high speed massive data stream has much better running efficiency than traditional ensemble algorithm.In summary, optimized ensemble algorithms is designed to cope with the special requirements of data stream. It not only retains the advantages of high classification accuracy, but improves the efficiency of classification and reduces the computational overhead. It finally becomes more and more practical.
Keywords/Search Tags:Data streams, ensemble classification, concept drifting, Cloudcoinputing, MapReduce
PDF Full Text Request
Related items