Font Size: a A A

Research On Parallelization Of Data Stream Clustering Algorithm For Police Data

Posted on:2019-01-21Degree:MasterType:Thesis
Country:ChinaCandidate:Y LiFull Text:PDF
GTID:2348330563453984Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the rapid development of information technology,a large number of streaming data are generated in real time in meteorological prediction,financial transactions,sensor networks and other important areas.Different from the traditional static data,the features of data streams such as timeliness and infinity amount raise new challenges for big data mining.The first problem for clustering algorithms is how to cluster high-dimensional data streams quickly and precisely in limited processing times.Therefore,this paper proposes a two-layer data stream clustering algorithm,and proposes an improved parallel algorithm for the distributed police data streams clustering.The main works include the following aspects.(1)Research on local density based data stream clustering algorithm DCSC.According to the requirement of fast online processing for high-dimensional data streams with noise,this paper proposed an online-offline structured data stream clustering algorithm based on the static data clustering algorithm DENCLUE.In order to solve the problem of concept drift exists in streaming data,the hypercubes are innovated to attenuation hypercubes.Data stream items will be mapped to cubes in data space to update the abstract information;the attenuation function is used to define the maximum survival time for cubes,so the expired data can be eliminated in time,through that the algorithm can be able to track changes in data model.The relatively time-consuming offline clustering is combined with optimized hill climbing algorithm and snapshot technology,which allows users to execute offline phase at anytime according to their requirements and get clustering results quickly,and it will not affect the execution of online stage.The experiment shows that the algorithm can adapt to the changing data stream,while the online processing speed of the algorithm is well improved compared with some other two phases clustering algorithm.(2)The parallelization of the DCSC algorithm and research on adaptive parameter selection algorithm.In order to deal with the sensitiveness of density-based clustering algorithm parameters,an adaptive important parameter select method is proposed in this paper.The principle to deduce parameters is concluded from the feature of DCSC,and with the distribution of initial data set parameters can be obtained automatically.DCSC‘s ability to deal with massive streaming data is insufficient because its' maximum processing capability is limited to single node configuration.Therefore,this paper proposes a parallelization algorithm PDCSC to improve the hypercube maintenance mechanism and offline process.In the online phase,an optimized global summarization method combined with the idea of micro batch processing is proposed,so that all nodes can share global information in time when processing its' piece of data separately.Offline data is divided into separate regions;the cluster will not span multiple regions so that worker nodes can perform cluster mining in parallel.The experiment shows that the throughput of the algorithm can be improved effectively through the horizontal scale-up of worker nodes,and the overall processing speed of the algorithm is improved obviously.(3)The design and implementation of the target spatiotemporal trajectory mining system.The system is built on the Spark and Kafka clusters,and the PDCSC algorithm is applied to the police streaming data mining scene.It solves the problem of finding important objects from massive monitoring data stream,and provides data basis for subsequent target behavior detection.According to user's selection,the system clustering the feature data stream collected from specific monitoring areas.When a cluster center is found highly similar to the features in the specified library,the temporal and spatial trajectories of the target will be sorted out and alerted to the user.
Keywords/Search Tags:data streams, density-based clustering, parallelization, Spark, police data
PDF Full Text Request
Related items