Font Size: a A A

Research And Application Of Key Technologies Of Distributed Computing Over Data Streams

Posted on:2018-12-04Degree:MasterType:Thesis
Country:ChinaCandidate:J J XiongFull Text:PDF
GTID:2348330512481345Subject:Engineering
Abstract/Summary:PDF Full Text Request
In the era of big data,a massive amount of data streams is generated continuously.How to find out valuable information in massive data streams is a valuable topic.Traditional data stream mining technology is limited by stand-alone environment and can not deal with the challenge of massive data streams.The distributed stream processing platform,such as Spark Streaming,is an effective tool to solve this problem,but the data stream mining algorithms which support distributed computing are relatively scarce.How to parallelize traditional algorithms to adapt to distributed environment is the focus of this paper.At the beginning,this thesis introduces the basic concepts of data stream mining and basic technologies of distributed stream processing platform,and then puts forward parallelization measures for specific algorithms.The main work of this thesis can be summarized as following:(1)The parallelization of density-based data stream clustering algorithm DenStream.DenStream can only be used in stand-alone environment,its throughput is limited by this environment.So it can not deal with the challenge of massive data streams.According to the survey,there is scarcely any study about the parallelization of DenStream in published research achievements so far.The parallelization method proposed in this thesis focuses on parallelizing the online microcluster maintenance part of DenStream.This improvement is based on the concept of micro-batching.First,the algorithm collects all the arrived records within a short period of time,then performs follow-up steps.In order to ensure the independence of tasks on each node,global data needed in computing of each node is distributed to nodes in advance.Experimental results show that new algorithm keeps the accuracy of DenStream;the processing speed has greatly improved,and it can be easily scaled up horizontally.(2)The parallelization of frequent-pattern mining algorithm in data streams FPStream.Regardless of pruning,the complexity of FP-Stream is O(2n)in relation to the average length of transactions.When the average length of transactions increases,FPStream's throughput decreases drastically.Therefore,the parallelization of FP-Stream is very necessary.FP-Stream uses the non-pruning FP-Growth algorithm as a sub-process.Based on this feature,the new algorithm uses a special parallelization strategy to parallelize this sub-process.For load balance,new algorithm divides the FP-Stream structure into two parts,and the two parts are processed in turn.Global data is distributed to each node in each round of processing in order to ensure independence between nodes.Experimental results show that the processing speed of new algorithm is greatly improved,and it can be easily scaled up horizontally.(3)The design and implementation of a distributed platform,which is used to monitor and predict the usage of host's resources in a cluster.In a cluster,the host's resource consumption is often related to its role in the cluster.Based on these two algorithms,the platform uses data stream mining technologies to analyze host's logs in real time,monitors and forecasts resource consumption of hosts.
Keywords/Search Tags:Data stream, Distributed compute, Density based clustering, Frequent-pattern mining
PDF Full Text Request
Related items