Font Size: a A A

Researchon Real-time Data Streams Clustering Framework

Posted on:2014-10-15Degree:MasterType:Thesis
Country:ChinaCandidate:Z L LiFull Text:PDF
GTID:2268330392962830Subject:Software engineering
Abstract/Summary:PDF Full Text Request
The proliferation of Cloud Computing, Internet of Things and the Mobile Internet resultin the generation of huge amount of data. It becomes an important concern of the researchand the business to mine and discover knowledge from the big data in a quick way,especially data streams generated in real-time. There are several technical issues toaddress, such as data canonly be accessed once, limited computing resources comparing tothe amount of data to process, and requirement of nearly real-time responses, in processingthis new style ofdata. Thus, how to make the best use of the limited resources to mine datastreams in a quick way is a big challenge. Clustering, which aims to divide similar objectsinto the same category, is a significant research perspective of the data mining. However,the traditional clustering frameworks or algorithms cannot be easily applied in mining datastreams that it requires the new frameworks and algorithms for better clustering. Atpresent, there are some data streams mining frameworks, like the classic two-componentframework in CluStream, providing the mining solutions for data streams. Nevertheless,these frameworks pay more attention to the effective synopsis and storage of data streams,which cannot provide more clustering capacity based on the confined resources.Identifying the requirements and challenges in data streams mining, this paper proposes astage-based data streams clustering framework(SRAStream) based on the concept driftdetection. The objective of this framework is that ensures faster clustering within certainaccuracy based on the limited resources, i.e. improves the efficiency of data processingwith the cost of an acceptable accuracy loss. The proposed solution is that it will performthe refined clusteringthat is triggered within a due time. Otherwise it will provide the latestclusters as the result. In this way, the number of repetitive computation could be reduced toimprove the processing capacity.The proposed framework consists of four modules: Quick Computing Module, EvolvingDetecting Module, Clustering Module, and Resource Monitoring Module. Furthermore,within this framework, this paper also proposes a concept drift detecting algorithm whichemploys a quick clustering solution to achieve an accurate detection and then perform therelated detecting calculation. Those clusters obtains by the quick clustering solution couldalso be applied in further refined clustering in the Clustering Module in which it could improve the clustering speed. Besides, this paper constructs and analyses the mathematicmodels of the related modules, and compares the solution of this paper with CluStreamfrom time complexity and space complexity perspectives. In the end, with datasets fromUCI and other resources, we construct the accuracy related experiments and data streamsprocessing speed simulation experiments. The experiments results indicate that thisproposed framework does work well in data streams clustering.
Keywords/Search Tags:Big Data, Data Streams, ClusteringFramework, Stage-basedTraining, ConceptDrift
PDF Full Text Request
Related items